Intuition & Data-Driven Machine Learning

By Ilya Grigorik on April 20, 2011

Clever algorithms and pages of mathematical formulas filled with probability and optimization theory are usually the associations that get invoked when you ask someone to describe the fields of AI and Machine Learning. Granted, there is definitely an abundance of both, but this mental picture also tends to obscure some of the more interesting and recent developments in these fields: data driven learning, and the fact that you are often better off developing simple intuitive insights instead of complicated domain models which are meant to represent every attribute of the problem.

Back in early 2001, Eric Brill and Michele Banko published an incredibly interesting paper (Mitigating the Paucity of Data Problem), which showed that simply increasing the training set by orders of magnitude yielded significant improvements in the performance of common Machine Learning algorithms. In fact, in certain cases, you are simply better off working on getting more data, then spending your time on improving the algorithm - think about that for a minute! We can even take this model to an extreme: data is the algorithm. How does Google Translate support 60+ languages? They don't employ any linguists. Instead, the translation is done through pure data-driven learning.

GoGaRuCo 2010: Machine Learning Trends & Patterns

The goal for my presentation at GoGaRuCo 2010 (slides) was to highlight the following themes: first, the algorithm is important, but we tend to overemphasize its importance; simple, intuitive insights are usually the underpinnings of much more (seemingly) complicated algorithms; data can be an algorithm in by itself. None of these ideas are novel or my own. In fact, much of the recent credit goes to Peter Norvig for popularizing these ideas and driving their adoption at Google (Translate, Sets, Spelling correction, and so on) and throughout the industry at large.

"It's really quiet simple"

Perhaps the largest mental barrier for most developers is that Machine Learning "is hard". It doesn't have to be. In fact, having an intuitive mental model is often more helpful than having a "tight proof of the error boundary" when it comes to implementing one of these systems in real life.

One of my favorite examples, and one that I shared in my presentation, is that of clustering. Namely, to perform a clustering, we need to define a function to measure the "pairwise distance" between all pairs of objects. Can you think of a generic way to do so? How about using your favorite Zlib library?

require 'zlib'
require 'pp'

files = Dir[ARGV[0] + '/*']

def deflate(*files)
  z = Zlib::Deflate.new
  z.deflate(files.collect {|f| open(f).read}.join("\n"), Zlib::FINISH).size
end

pairwise = files.combination(2).collect do |f1, f2|
  a, b = deflate(f1), deflate(f2)
  both = deflate(f1, f2)

  {:files => [f1, f2], :score => (a+b)-both}
end

pp pairwise.sort {|a,b| b[:score] <=> a[:score]}[0,20]

# > ruby clusterer.rb /path/to/dir

The insight is simple: if any two objects, whatever data they main contain, have overlapping data inside, then the compressed filesize of these two files when they are joined together should be smaller than the sum of their individually compressed files. Work through a few examples in your head. Once you prove it to yourself, it will seem remarkably simple.

Best of all, this is a generic insight which allows you to cluster virtually any type of data. In fact, the example above will faithfully detect and cluster different languages, separate png and mp3 files, and so on.

Ensembles & Collaborative learning

That is not to say that having a solid understanding of the fundamentals of ML and AI fields is not important - it absolutely is. However, given a choice between a more complicated algorithm and more data, try more data first. Further, instead of trying to develop a complicated algorithm, which encompasses many attributes of your problem, try developing an ensemble method which builds on dozens of simple, intuitive insights.

Think simple, get more data, and see where it takes you - chances are, you will be surprised.

Ilya Grigorik is a web ecosystem engineer, author of High Performance Browser Networking (O'Reilly), and Principal Engineer at Shopify — follow on Twitter.

Intuition & Data-Driven Machine Learning

GoGaRuCo 2010: Machine Learning Trends & Patterns

"It's really quiet simple"

Ensembles & Collaborative learning

High Performance Browser Networking - O'Reilly