Machine learning is the use of computer algorithms to identify patterns in data and then classify new data according to those patterns. K-Nearest Neighbors (k-NN) is a simple and intuitive algorithm for machine learning. Using just basic algebra, k-NN is a powerful tool for many data classification tasks. The following examples demonstrate how a simple k-NN implementation works to classify colors, identify handwritten digits, and to predict the topics of articles from BBC News. R code for each example is also included.

Note: The code to replicate these examples can be found on my GitHub. To run the provided examples yourself, you will need R – a free program for statistical computing. Also, R is excellent and there’s no good reason not to have it.

### Colors: An Intuitive Example

The millions of colors on your computer screen are created by combining different levels of Red, Green, and Blue. Each component takes a value from 0 (no color) to 255 (full color). (255,0,0) produces red, because the red component is maxed out and the other two primary colors are not included. (255,255,255) is white and (0,0,0) is black. Combinations like (242,104,12) and (126,91,164) produce all of the exciting colors on your monitor. So, if we randomly draw 1000 values between 0 and 255 for red, green, and blue, we can create 1000 random colors. These colors are shown above.

Let’s imagine that we’d like to use k-NN to classify these colors into categories: red, green, and blue. Start with an intuitive way to determine whether a given color, say (197,179,83), is closer to red or green. If we think of each color’s definition as a point (x,y,z) in three-dimensional space, then one solution would be to use euclidean distance. The distance, then, from (197,179,83) to (255,0,0) is:

The distance from (197,179,83) to (0,255,0) is:

Given this metric, we conclude that the questionable color is closer to red than green. Now, let’s iterate that for the entire set of random colors we have created. For each of the 1000 colors we created, we calculate the distance from that color to each of our category colors red, green, and blue. Then, the category associated with the shortest distance to each of our random colors is assigned to that color. That is the foundation of k-NN machine learning. Here are our original 1000 random colors, grouped by k-NN:

The code for this is pretty simple. I’ve written it up as the two functions below:

guess.knn <- function(x, train, trainlabels, k) { # Calculate the euclidean distance from x to all training set observations: xmatrix <- matrix(as.numeric(x), nrow=nrow(train), ncol=length(x), byrow=T) xmatrix <- (as.matrix(train)-xmatrix)^2 diffs <- sqrt(rowSums(xmatrix)) # Sort the category labels by their euclidean distances: diffs <- data.frame(dist=diffs,label=trainlabels) diffs <- (diffs[order(diffs$dist),]) diffs <- diffs[1:k,] # Find the most commonly-matched category among the top k results: guess <- names(sort(-table(diffs$label)))[1] return(guess) } knn <- function(train, test, trainlabels, k=1) { results <- apply(test, 1, function(x) guess.knn(x, train, trainlabels, k)) return(results) }

After loading the above functions, the color example can be run using this code:

# Create random colors dataframe: colors <- data.frame(matrix(round(runif(3000,0,255)), nrow=1000, ncol=3)) names(colors) <- c("red","green","blue") # Create hex-valued colors for plotting purposes only: colors$composite <- paste("#", as.character(as.hexmode(colors$red)), as.character(as.hexmode(colors$green)), as.character(as.hexmode(colors$blue)), sep="") # Plot uncategorized data: plot(0, 0, type="n", xlim=c(1,1000), ylim=c(0,1), xlab="Random Colors", ylab="", yaxt="n", frame=F) rect(0:999,rep(0,1000),1:1000,rep(1,1000), col=colors$composite, border=NA) # Create training set of three colors: traincolors <- data.frame(rbind(c(255,0,0),c(0,255,0),c(0,0,255))) trainlabels <- c("red","green","blue") # Run kNN algorithm: colorresults <- knn(traincolors, colors[,1:3], trainlabels, k=1) # Sort results by predicted category and plot them: colors$result <- colorresults colors <- colors[order(colors$result),] plot(0, 0, type="n", xlim=c(1,1000), ylim=c(0,1), xlab="Sorted Colors", ylab="", yaxt="n", frame=F) rect(0:999,rep(0,1000),1:1000,rep(1,1000), col=colors$composite, border=NA)

### Character Recognition

Now that we’ve established the basic process of k-NN, let’s use it for a more difficult task. Given a training set of handwritten digits 0 through 9, correctly label the digits in an unlabeled test set. The data for this project come from MNIST. The handwritten digits have been scanned and centered in 28×28 pixel images. Therefore, there are 784 pixels per digit. Every digit can then represented as a vector of length 784 with each element of the vector taking a value from 0 (white) to 255 (black). So we can think of these as points in 784-dimensional space: (x, y,…, z). Calculating the euclidean distance between these points is the same as above:

So, to identify handwritten digits, all we need is a training set of already labeled digits, and a test set of digits that we would like to identify. For every digit, we calculate the euclidean distance from that digit’s point representation to every digit in the training set. We find the closest, or most similar, image from the training set and apply its label to the unidentified digit in question.

Despite its simplicity, this method performs with 95% accuracy on the dataset from MNIST. This improves to roughly 98% if, instead of the typical euclidean distance, we use the l3-Norm for distance:

### Classifying Newspaper Articles

BBC News provides a dataset of 2225 articles from 5 sections of its website: business, entertainment, politics, sport, tech. All stop words have already been filtered out, so this is an ideal dataset to test k-NN on machine-coding news articles. Instead of providing the actual text of each article, the BBC provides a count of how many times each of 9636 words appear in each article. Clearly, most articles contain only a few of these words. The goal is to determine, using just these word counts and a training set, what section of the news each article is from. For this test, I have iterated over the entire 2225 articles predicting each article’s category given all of the other articles. Using the l3-Norm, 86.5% of articles are categorized correctly. Not bad considering we would expect roughly 20% success if we guessed randomly. By category, the success rate is:

Category | Percent Correct | n |
---|---|---|

Business | 85.7% | 510 |

Entertainment | 80.3% | 386 |

Politics | 79.9% | 417 |

Sports | 99.0% | 511 |

Technology | 84.3% | 401 |

Pretty neat that a single algorithm can tackle such seemingly different challenges so well, huh? If you want to try any of these examples yourself, you can find all of the relevant code on my GitHub. This includes expanded versions of the functions above that allow for different distance norms to be used. Don’t sweat about finding the data, either; it will be downloaded automatically when the code is run. And if you have any ideas for cool data to categorize with k-NN, suggest it in the comments.

You are really talented in CS, Ben! Big loss for Google for not having you there~~ lol

Thanks! 🙂