In the previous Tutorial on R Programming, I have shown How to Perform Twitter Analysis, Sentiment Analysis, Reading Files in R, Cleaning Data for Text Mining and more. In this post you will learn about very popular kNN Classification Algorithm using Case Study in R Programming.
What is kNN Algorithm?
kNN stands for k-Nearest Neighbors and it very simple and effective Classification algorithm to use. The best part of kNN algorithm is Fast Training Time. Few Points about the kNN algorithm are as follows:
- Supervised Learning Algorithm
- Easy to Learn and Apply
Efficiency of the kNN algorithm largely depends on the value of K i.e, choosing the number of nearest neighbors. You can select large value of K to reduce noisy data but there are also side effect of ignoring the small values.
Calculating Distance
Most popular used distance method in kNN algorithm is Euclidean distance. The formula to calculate distance between two points can be understood by the following diagram.
For example to Calculate Distance between Two Points say A(fac1 = 6, fac2 = 4) and B(fac1 = 3, fac2 = 7), we can use the above mentioned formula and calculate the distance as shown below:
kNN Classification Using R Programing
Before starting with the kNN implementation using R, I would like to discuss two Data Normalization/Standardization techniques.
Min-Max Normalization:
This process transforms a feature value in such a way that all of its values falls in the range of 0 and 1. Formula to do so is below:
Z-Score Standardization:
This process subtracts the mean value of feature ‘x’ and divides it with the standard deviation of ‘x’. Formula is shown below:
Case Study: Using Iris DataSet Classify the species.
iris_data <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE, stringsAsFactors = FALSE) names(iris_data) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") str(iris_data) table(iris_data$Species)
Output of the Table() function
To Normalize the data use the following function.
normalize <- function(x) { return((x-min(x))/max(x)-min(x)) } iris_n <- as.data.frame(lapply(iris_data[1:4], normalize))
Now Create Training and Test Data Set.
set.seed(1234) ind <- sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33)) iris_train <- iris[ind==1, 1:4] iris_test <- iris[ind==2, 1:4] iris_train_labels <- iris[ind==1, 5] iris_test_labels <- iris[ind==2, 5]
Load Class Package and Call knn() function to build the model.
library(class) iris_test_pred<-knn(train=iris_train, test=iris_test,cl=iris_train_labels,k=3)
To evaluate the model, Use CrossTable() function in “gmodels” package.
library(gmodels) CrossTable(x=iris_test_labels, y=iris_test_pred, prop.chisq = FALSE)
Output:
As you can see from the table above the model make one error mistake of predicting as versicolor instead of virginica, whereas rest all the predicted correctly.
Since the Data contains only 150 observations, the model makes only one mistake. There may be more mistakes in Huge Data. In that case we need to try using different approach like instead of min-max normalization use Z-Score standardization and also vary the values of k to see the impact on accuracy of predicted result.
I hope you like the Tutorial on kNN Classification Algorithm using R Programming. In the next tutorial I will show the comparison of kNN with Different values of K and different Normalization techniques.
Very nice work sir. 🙂
Thanks Shubham..