Bayes Classification in Ruby

The Family Guy saga continues. A few days ago the editors of the fan site decided to add a new section: favorite quotes. The users responded with enthusiasm, and began submitting hundreds of their favorite gems. Needless to say, the editors were overwhelmed and decided to invite the engineers to pitch in and help sort through the submissions. Of course, after about five minutes of manual labor, the engineers, who were versed in the intricacies of latent semantic indexing and other machine learning techniques, promptly gave up and automated the process - they built a Bayes classifier.

Naive Bayes and its uses

Bayes classifiers are a class of simple probabilistic algorithms which apply Bayes' theorem in order to learn the underlying probability distribution of the data (with a few simplifying assumptions, hence the 'Naive'). In our case, each stemmed word in a quote is taken to be a unique variable in the model, and the goal is to find the probability of that word, and consequently the quote itself, belonging to a certain class: Funny vs. Not Funny. As you may have already guessed, Bayes' classifiers have found extensive use in spam filtering. In fact, one of the most popular packages (SpamAssassin), is a direct implementation of Naive Bayes. It is an amazingly simple approach, and also a surprisingly powerful one.

Building the classifier

Our machine learning guru didn't last very long when he was asked to handpick the quotes. He managed to classify about ten, which he promptly stored in YAML format on his disk and decided to use as a training set for his algorithm. Of course, as with any machine learning technique, the more examples the better, and the developer knew that he needed a much bigger training set (100+), but here he went with what was available. After all, this was just a proof of concept:

require 'rubygems'
require 'yaml'
require 'stemmer'
require 'classifier'

# Load previous classifications
funny     = YAML::load_file('funny.yml')
not_funny = YAML::load_file('not_funny.yml')

# Create our Bayes / LSI classifier
classifier ='Funny', 'Not Funny')

# Train the classifier
not_funny.each { |boo| classifier.train_not_funny boo }
funny.each { |good_one| classifier.train_funny good_one }

# Let's classify some new quotes
puts classifier.classify "Peter: A boat's a boat but a box could be anything! It could even be a boat!"
puts classifier.classify "Stewie: Damn you ice cream, come to my mouth! How dare you disobey me!"
puts classifier.classify "Brian: I could take my sweater off too, but I think it's attached to my skin. "
puts classifier.classify "Peter: Hey, anybody got a quarter? Bill Gates: What's a quarter? "
puts classifier.classify "Peter: I had such a crush on her. Until I met you Lois. You're my silver medal. "
puts classifier.classify "Meg: Excuse me, Mayor West? Adam West: How do you know my language? "
puts classifier.classify "Meg: You could kill all the girls who are prettier than me. Death: Well, that would just leave England. "

It really is that simple. Install Lucas Carlson's classifier gem (gem install classifier), and you are ready to roll. As part of the package, you will get the Naive Bayes implementation, complimented by LSI (Latent Semantic Indexing), a more advanced technique which we covered in a previous iteration. Executing the code produces:

Not funny
Not funny
Not funny
Not funny - Code + Training examples

The results are definitely up for a debate, but the classification process is working and is bound to get better as the number of the training examples will continue to grow. More importantly, the developers quickly realized that they could offer this on an individual basis - after all, we all have different opinions on what's funny and what's not. They reasoned it would be a great addition to the season recommendation algorithm they deployed earlier - and so it was done. Can't blame them, all it took was 10 lines of Ruby!

In previous iterations: SVD Recommendation System, Decision Tree Learning, Support Vector Machines

Ilya GrigorikIlya Grigorik is a web performance engineer at Google, co-chair of the W3C Web Performance working group, and author of High Performance Browser Networking (O'Reilly) book — follow on Twitter, Google+.