Feature Scaling

·

3 min read

Table of contents

Feature scaling is a method for uniformly distributing the independent features in the data over a predetermined range. It happens during the pre-processing of the data.

Working: 5000 individuals are given a data set with the features Age, Salary, and BHK Apartment, each of which is an independent data feature.

Each data point is marked as follows:

Class 1: YES (means with the given Age, Salary, BHK Apartment feature value one can buy the property)

Class 2-NO (meaning one cannot purchase the property with the stated age, salary, and value of the BHK Apartment feature).

To develop a model that can predict whether a person can acquire a property with specified feature values, one uses a dataset to train the model. An N-dimensional (where N is the number of features contained in the dataset) graph made up of data points from the given dataset can be constructed once the model has been trained. The model is trained using the data points shown by stars in the picture, which correspond to Class1 - Yes labels and circles to Class2 - No labels, respectively. Now a new data point is shown (the diamond in the picture), and it contains several independent values for the three features (age, salary, and BHK apartment) that were described earlier. The model has to predict whether this data point belongs to Yes or No.

The type of fresh data points is predicted: This data point's separation from each class group's centroid is determined by the model. Last but not least, this data point will belong to that class, which will be separated from it by a minimal centroid distance. These techniques can be used to calculate the separation between the centroid and the data point.

The Euclidean distance between a data point and the centroid of each class is equal to the square root of the sum of the squares of the differences between their coordinates (feature values: Age, Salary, and BHK Apartment). The Pythagorean theorem provides the formula.

d(x, y)=\sqrt[r]{\sum_{k=1}^{n}\left(x_{k}-y_{k}\right)^{r}}

where k is the number of feature values, x is the data point value, y is the centroid value, and an example data set has k = 3.

Manhattan Distance: It is determined as the total of the absolute disparities between the coordinates (feature values) of the data point and the centroid of each class.

d(x, y)=\sum_{k=1}^{n}\left|x_{k}-y_{k}\right|

Minkowski Distance: It is a broadening of the first two techniques. Different values can be used to find r, as indicated in the image.

Need of feature scaling

Age, Salary, and BHK Apartment are the three attributes in the presented data set. Think about a range of 10- 60 for age, 1–40 for salary, and 1–5 for the number of bedrooms in a home. These features stand alone from one another. Assume the data point to be predicted is [57, 33 Lacs, 2] and the centroid of class 1 is [40, 22 Lacs, 3].

Applying the Manhattan Method,

Distance = (|(40 - 57)| + |(2200000 - 3300000)| + |(3 - 2)|)

Since all the features are independent of one another—for example, a person's salary has nothing to do with his or her age or what requirements they have for a flat— the Salary feature will predominate all other features when predicting the class of the supplied data point. Thus, the model will consistently make incorrect predictions.

So, Feature Scaling is the straightforward answer to this issue. Age, Salary, and BHK will be scaled by feature scaling algorithms in a fixed range, such as [-1, 1] or [0, 1]. No feature will then be able to dominate others.

Some points are remaining to cover in this topic so in the next blog we will focus on discussing those points and covering new topics as well. Thank you for reading.