K-Nearest Neighbors (K-NN): Classification & Regression Problem

Alok Choudhary
6 min readSep 7, 2024

--

K-Nearest Neighbors is a non-parametric, instance-based learning algorithm that can be used for classification and regression. Intuitively, the algorithm should predict the category or value for a new data point depending on its nearest neighbors in feature space. In this article, we will discuss some of the basic mathematical features and techniques used within both classification and regression using K-NN.

1. Classification in K-NN

Classification is a type of supervised learning where it should predict the class or category of a given data point. For this, k-NN will evaluate the nearest data points to the input and use their labels to predict the class of the new data point.

Steps:-

1. Features (input data): The input features are represented as a series of feature vectors:

f1​ , f2 ​, … , fn​ → y

Here, f1 , f2 , … , fn​ are the input features, and yyy is the output category, which is a discrete variable, usually 0 or 1 (for binary classification), or any other set of predefined categories.

2. Output (o/p): The algorithm outputs a fixed number of categories based on the majority vote of the closest points. The output y is either 0 or 1, for binary classification.

3. Classification: In the below plot, two categories are shown: category A (circles) and category B (crosses). When a new data point is introduced, K-NN evaluates the nearest neighbors, and the category that is most frequent among the kkk nearest neighbors (in this case k=5) is assigned as the predicted category of the new data point.

2. Distance Metrics

K-NN relies heavily on the concept of distance in order to decide which points are most similar to the new data point. The following four of the most common distance metrics used to calculate how close points are in feature space.

1. Euclidean Distance: Euclidean distance is the most common distance metric used in K-NN. It is calculated as the straight-line distance between two points in the feature space. Mathematically, for two points (x1,y1) and (x2,y2).

2. Manhattan Distance (L1 norm): Manhattan distance measures the distance between two points along axes at right angles. It is the sum of the absolute differences of their coordinates. For points (x1,y1) and (x2,y2) the formula is:

3. Hamming Distance: Hamming distance is used for categorical variables. It is simply the number of feature positions where the corresponding values of two data points differ. If the strings differ at n positions, the Hamming distance is n. For binary strings, this is a common metric.

4. Minkowski Distance: Minkowski distance is a generalized distance metric that can be adjusted to resemble either Euclidean or Manhattan distance, depending on the parameter p. The formula for Minkowski distance between points (x1,y1) and (x2​,y2​) is:

When p=1, Minkowski distance becomes Manhattan distance.

When p=2, Minkowski distance becomes Euclidean distance.

3. Regression in K-NN

In regression tasks, K-NN can be used to predict continuous values rather than categories. Instead of majority voting, regression predictions are made by averaging the values of the k-nearest neighbors.

Steps:

1. Features (input data): The input features are continuous variables. For example, size and the number of rooms are used as features to predict the price of a house.

2. Output (o/p): The output in regression is continuous, unlike classification. In this case, we predict the price, which is a real number.

3. Regression: The plot illustrates the data points with size (X-axis) and number of rooms (Y-axis). The predicted value for a new data point is obtained by averaging the values of the nearest k=5 neighbors. An outlier (data point far from others) is shown in the plot, which can affect the prediction accuracy.

The formula used in regression is:

4. Key Considerations for KNN

Choosing the Right Value of k:

Small values of k make the model sensitive to noise in the data, leading to high variance.

Large values of k smooth the decision boundary, leading to underfitting.

A common strategy is to use cross-validation to select the optimal k.

Feature Scaling:

Since KNN is distance-based, it’s important to ensure that all features are on the same scale. This is typically done by standardizing or normalizing the feature values.

Standardization rescales the data to have a mean of 0 and a standard deviation of 1:

Normalization rescales the values between 0 and 1:

Curse of Dimensionality:

It also suffers in high-dimensional spaces because the distance between points becomes less informative as the number of dimensions increases, which is also known as the curse of dimensionality. For this, the work can be carried out by reducing the dimensions using some techniques like PCA or t-SNE.

5. Improving KNN

There are several ways to enhance the performance of KNN:

  • KD-Trees and Ball Trees can be used to speed up the nearest neighbor search.
  • Dimensionality Reduction techniques like PCA help mitigate the curse of dimensionality.
  • Distance Metric Learning allows the model to learn a more appropriate distance metric for the task at hand.

6. Limitations of K-NN

K-NN has some notable limitations, especially when dealing with real-world data:

Huge Dataset:

K-NN is computationally expensive if the dataset is huge. It needs to compute the distance between the new point and all the other points of the dataset. That doesn’t scale very well when the size of the dataset is huge.

Sensitive to Outliers:

K-NN is sensitive to outliers, since the predicted values can be biased by an outlier-as can be seen in the regression plot-if this is returned in the nearest neighbors’ calculation, even when it is not representative of the majority of the data.

Sensitive to Missing Values:

The algorithm is sensitive to missing values. In case of features with missing values, it cannot compute reasonable distances nor make sensible predictions.

Conclusion

K-NN is a simple, intuitive algorithm that can be applied to both classification and regression tasks. The choice of distance metric plays a crucial role in its performance, and careful consideration is needed when dealing with large datasets, outliers, or missing data. Despite these limitations, K-NN is often used as a baseline model due to its simplicity and effectiveness, especially for small and well-structured datasets.

Follow me on LinkedIn to explore more about AI and data science innovations.

--

--

Alok Choudhary
Alok Choudhary

Written by Alok Choudhary

My Self Alok Choudhary, a Data Science scholar at IIT Patna, is pioneering in AI, ML, DS, and DL, crafting algorithms that redefine the tech landscape.

No responses yet