# K Nearest Neighbors (KNN)

K Nearest Neighbors (KNN) is one of the most popular and intuitive supervised machine learning algorithms. It is available in Excel using the XLSTAT software.

## What is K Nearest Neighbors (KNN) machine learning?

The **K Nearest Neighbors** (**KNN**) **algorithm** is a non-parametric method used in both classification and regression that assumes that similar objects are in close proximity. Objects that are close (in terms of a certain distance metrics) are thus supposed to belong to the same class, or share similar properties.

The **K Nearest Neighbors method** (**KNN**) aims to categorize query points whose class is unknown given their respective distances to points in a learning set (i.e. whose class is known a priori). It is one of the most popular supervised **machine learning** tools that is both used as a regression method and a classification method.

### What is K Nearest Neighbors (KNN) Classification?

A simple version of **KNN** classification algorithm can be regarded as an extension of the nearest neighbor method (NN method is a special case of KNN, k = 1). The nearest neighbor method consists in assigning to an object the class of its nearest neighbor.

The **KNN classification** approach assumes that each example in the learning set is a random vector in Rn. Each point is described as x =< a1(x), a2(x), a3(x),.., an(x) > where ar(x) denotes the value I of the rth attribute. ar(x) can be either a quantitative or a qualitative variable.

To determine the class of the query point xq, each of the k nearest points x1,…,xk to xq proceed to voting. The class of xq corresponds to the majority class.

### What is K Nearest Neighbors (KNN) Regression?

The goal of the **K Nearest neighbors (KNN) regression****algorithm**, on the other hand, is to predict a numerical dependant variable for a query point xq, based on the mean or the median of its value for the k nearest points x1,...xk.

## K Nearest Neighbors in XLSTAT: options

**Distances**: Several distance metrics can be used in XLSTAT to compute **similarities** in the **K Nearest Neighbors** algorithm. Options vary according to the type of variables characterizing the observations (qualitative or quantitative).

- Distances available for quantitative data (metrics): Euclidian, Minkowski, Manhatan, Tchebychev, Canberra
- Distances available for quantitative data (kernels): linear, sigmoid, logarithmic, power, Gaussian, Laplacian
- Distances available for qualitative data: Overlap Metric (OM), Value Difference Metric (VDM)

**Validation**: XLSTAT proposes a **K-fold cross validation** technique to quantify the quality of the classifier. Data is partitioned into k equally sub samples of equal size. Among the *k* subsamples, a single subsample is retained as the validation data to test the model, and the remaining *k* − 1 subsamples are used as training data. This technique can be used to find the best number of neighbors to use so that the** algorithm** performs the best on the considered dataset.

Other options available in the XLSTAT **K Nearest Neighbors** feature include **observation tracking** as well as **vote weighing**.

## K Nearest Neighbors in XLSTAT: results

The **K Nearest Neighbors (KNN)** feature in XLSTAT includes displaying results by class or by object (observation).

Although the **K Nearest Neighbors** (**KNN**) algorithm is one of the most popular machine learning techniques in classification, other algorithms such as **classification and regression random forests** might be considered instead when it comes to working with a bigger dataset.

