Complete Tutorial of kNN Classification Algorithm using R Programming

In the previous Tutorial on R Programming, I have shown How to Perform Twitter Analysis, Sentiment Analysis, Reading Files in R, Cleaning Data for Text Mining and more. In this post you will learn about very popular kNN Classification Algorithm using Case Study in R Programming.

What is kNN Algorithm?

kNN stands for k-Nearest Neighbors and it very simple and effective Classification algorithm to use. The best part of kNN algorithm is Fast Training Time. Few Points about the kNN algorithm are as follows:

  • Supervised Learning Algorithm
  • Easy to Learn and Apply

Efficiency of the kNN algorithm largely depends on the value of K i.e, choosing the number of nearest neighbors. You can select large value of K to reduce noisy data but there are also side effect of ignoring the small values.

Calculating Distance

Most popular used distance method in kNN algorithm is Euclidean distance. The formula to calculate distance between two points can be understood by the following diagram.

knn-Euclidean distance-using-r-programming

For example to Calculate Distance between Two Points say A(fac1 = 6, fac2 = 4) and B(fac1 = 3, fac2 = 7), we can use the above mentioned formula and calculate the distance as shown below:

knn-distance-calculation-example

kNN Classification Using R Programing

Before starting with the kNN implementation using R, I would like to discuss two Data Normalization/Standardization techniques.

Min-Max Normalization:

This process transforms a feature value in such a way that all of its values falls in the range of 0 and 1. Formula to do so is below:

Min-Max Normalization-formula

Z-Score Standardization:

This process subtracts the mean value of feature ‘x’ and divides it with the standard deviation of ‘x’. Formula is shown below:

Z-Score Standardization-formula

Case Study: Using Iris DataSet Classify the species.

iris_data <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE, stringsAsFactors = FALSE)
names(iris_data) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
str(iris_data)
table(iris_data$Species)

Output of the Table() function
knn-using-r-table

To Normalize the data use the following function.

normalize <- function(x)
{
  return((x-min(x))/max(x)-min(x))
}
iris_n <- as.data.frame(lapply(iris_data[1:4], normalize))

Now Create Training and Test Data Set.

set.seed(1234)
ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33))
iris_train <- iris[ind==1, 1:4]
iris_test <- iris[ind==2, 1:4]
iris_train_labels <- iris[ind==1, 5]
iris_test_labels <- iris[ind==2, 5]

Load Class Package and Call knn() function to build the model.

library(class)
iris_test_pred<-knn(train=iris_train, test=iris_test,cl=iris_train_labels,k=3)

To evaluate the model, Use CrossTable() function in “gmodels” package.

library(gmodels)
CrossTable(x=iris_test_labels, y=iris_test_pred, prop.chisq = FALSE)

Output:

knn-using-r-crosstable-output

As you can see from the table above the model make one error mistake of predicting as versicolor instead of virginica, whereas rest all the predicted correctly.

Since the Data contains only 150 observations, the model makes only one mistake. There may be more mistakes in Huge Data. In that case we need to try using different approach like instead of min-max normalization use Z-Score standardization and also vary the values of k to see the impact on accuracy of predicted result.

I hope you like the Tutorial on kNN Classification Algorithm using R Programming. In the next tutorial I will show the comparison of kNN with Different values of K and different Normalization techniques.

2 Comments

  1. Shubham Juneja May 9, 2016

Leave a Reply