Support Vector Machines (SVM) in Ruby

Your Family Guy fan-site is riding a wave of viral referrals, the community has grown tenfold in last month alone! First, you've deployed an SVD recommendation system, then you've optimized the site content and layout with the help of decision trees, but of course, that wasn't enough, and you've also added a Bayes classifier to help you filter and rank the content - no wonder the site is doing so well! The community is buzzing with action, but as with any honey pot with high traffic, the spam bots have also arrived on the scene. No problem, you think to yourself, SVMs will be perfect for this one.

History of Support Vector Machines

Support Vector Machine (SVM) is a supervised learning algorithm developed by Vladimir Vapnik and his co-workers at AT&T Bell Labs in the mid 90's. Since their inception, they have continuously been shown to outperform many prior learning algorithms in both classification, and regression applications. In fact, the elegance and the rigorous mathematical foundations from optimization and statistical learning theory have propelled SVMs to the very forefront of the machine learning field within the last decade.

At their core, SVMs are a method for creating a predictor function from a set of training data where the function itself can be a binary, a multi-category, or even a general regression predictor. To accomplish this mathematical feat, SVMs find a hypersurface (for example, a plane in 2D) which attempts to split the positive and negative examples with the largest possible margin on all sides of the (hyper)plane.

Thus, as you can see from the diagram, SVMs make an implicit assumption that the larger the margin or distance between the examples and the hyperplane, the better the performance of a classifier will be - arguably a leap of faith, but in practice, this assumption has proven to perform extremely well. Certainly within the context of text classification (spam, or not spam, for example) SVMs have become the weapon of choice for most ML/AI researchers!

Installing and Configuring LIBSVM with Ruby

There is a plethora of available SVM implementations, but we will choose LIBSVM for our purposes. Aside from being one of the most popular libraries, it also happens to have a set of Ruby bindings to make our life much more enjoyable! To get yourself up and running:

# Install LIBSVM
$ sudo apt-get install libsvm2 libsvm-dev libsvm-tools

# Install RubySVM bindings
$ wget http://debian.cilibrar.com/debian/pool/main/libs/libsvm-ruby/libsvm-ruby_2.8.4.orig.tar.gz
$ tar zxvf libsvm*
$ cd libsvm*
$ ./configure
$ make && make install

Preparing the Data - Building Document Vectors

To perform text classification with SVMs we first have to convert our documents to use a vector space model. In this representation instead of working with words or sentences, the text is broken down into individual words, a unique id is assigned to each unique word, and the text is then reconstructed as a sequence of unique word ids. As usual, a picture is worth a thousand words:

Thus, if we use the global dictionary for each document, and mark all present words as '1', and missing words as '0', Document A can be represented as [1, 1, 0, 1, 1] - indices 1, 2, 4, and 5 are marked as 1, and index 3 (Ilya) is missing from this document. In similar fashion, Document B would become: [0, 1, 1, 1, 1]. Thankfully, this process is easily automated with a few Ruby one-liners:

# Sample training set ...
# ----------------------------------------------------------
  # Labels for each document in the training set
  #    1 = Spam, 0 = Not-Spam
  labels = [1, 1, 0, 1, 1, 0, 0]

  documents = [
    %w[FREE NATIONAL TREASURE],      # Spam
    %w[FREE TV for EVERY visitor],   # Spam
    %w[Peter and Stewie are hilarious], # OK
    %w[AS SEEN ON NATIONAL TV],      # SPAM
    %w[FREE drugs],          # SPAM
    %w[New episode rocks, Peter and Stewie are hilarious], # OK
    %w[Peter is my fav!]        # OK
    # ...
  ]

# Test set ...
# ----------------------------------------------------------
  test_labels = [1, 0, 0]

  test_documents = [
    %w[FREE lotterry for the NATIONAL TREASURE !!!], # Spam
    %w[Stewie is hilarious],     # OK
    %w[Poor Peter ... hilarious],    # OK
    # ...
  ]

# Build a global dictionary of all possible words
dictionary = (documents+test_documents).flatten.uniq
puts "Global dictionary: \n #{dictionary.inspect}\n\n"

# Build binary feature vectors for each document
#  - If a word is present in document, it is marked as '1', otherwise '0'
#  - Each word has a unique ID as defined by 'dictionary'
feature_vectors = documents.map { |doc| dictionary.map{|x| doc.include?(x) ? 1 : 0} }
test_vectors = test_documents.map { |doc| dictionary.map{|x| doc.include?(x) ? 1 : 0} }

puts "First training vector: #{feature_vectors.first.inspect}\n"
puts "First test vector: #{test_vectors.first.inspect}\n"

For the sake of an example we'll keep the training set nice and short - in production, we're going to use hundreds of examples to train our classifier. Nonetheless, executing our code returns (notice that word 'FREE' which corresponds to index 0 in the dictionary below is marked as present in both the first training and first test documents, just as expected):

Global dictionary:
["FREE", "NATIONAL", "TREASURE", "TV", "for", "EVERY", "visitor", "Peter", "and", "Stewie", "are", "hilarious", "AS", "SEEN", "ON", "drugs", "New", "episode", "rocks,", "is", "my", "fav!", "lotterry", "the", "!!!", "Poor", "..."]

First training vector: [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
First test vector: [1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0]

Training the Support Vector Machine

With the grunt work behind us, we're finally ready to train our spam classifier. By default, LIBSVM comes with a selection of kernels which map our vectors into a higher-dimensional space - usually the default 'linear' kernel is good in 99% of the cases, but for sake of experiment, let's try a few different ones:

require 'rubygems'
require 'SVM'
include SVM

puts "Spam filtering test with LIBSVM"
puts "-------------------------------"

# ... insert svm-documents.rb code

# Define kernel parameters -- we'll stick with the defaults
pa = Parameter.new
pa.C = 100
pa.svm_type = NU_SVC
pa.degree = 1
pa.coef0 = 0
pa.eps= 0.001

sp = Problem.new

# Add documents to the training set
labels.each_index { |i| sp.addExample(labels[i], feature_vectors[i]) }

# We're not sure which Kernel will perform best, so let's give each a try
kernels = [ LINEAR, POLY, RBF, SIGMOID ]
kernel_names = [ 'Linear', 'Polynomial', 'Radial basis function', 'Sigmoid' ]

kernels.each_index { |j|
  # Iterate and over each kernel type
  pa.kernel_type = kernels[j]
  m = Model.new(sp, pa)
  errors = 0

  # Test kernel performance on the training set
  labels.each_index { |i|
    pred, probs = m.predict_probability(feature_vectors[i])
    puts "Prediction: #{pred}, True label: #{labels[i]}, Kernel: #{kernel_names[j]}"
    errors += 1 if labels[i] != pred
  }
  puts "Kernel #{kernel_names[j]} made #{errors} errors on the training set"

  # Test kernel performance on the test set
  errors = 0
  test_labels.each_index { |i|
    pred, probs = m.predict_probability(test_vectors[i])
    puts "\t Prediction: #{pred}, True label: #{test_labels[i]}"
    errors += 1 if test_labels[i] != pred
  }

  puts "Kernel #{kernel_names[j]} made #{errors} errors on the test set \n\n"
}
svm.rb - SVM Classification Code

Running our SVM produces:

Global dictionary:
["FREE", "NATIONAL", "TREASURE", "TV", "for", "EVERY", "visitor", "Peter", "and", "Stewie", "are", "hilarious", "AS", "SEEN", "ON", "drugs", "New", "episode", "rocks,", "is", "my", "fav!", "lotterry", "the", "!!!", "Poor", "..."]

First training vector: [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
First test vector: [1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0]

And there you have it, our Polynomial and Radial basis function kernels correctly identified the spam messages in the system. Over time, and as we accumulate more and more examples, our kernel performance should only get better, and spam messages will be a thing of the past!

Previous iterations: SVD Recommendation System, Decision Tree Learning and Bayes Classification


Ilya Grigorik

Ilya Grigorik is a web performance engineer and developer advocate at Google, where his focus is on making the web fast and driving adoption of performance best practices at Google and beyond.