Ruby Algorithms: Sorting, Trie & Heaps

A good choice of an algorithm or a data structure can make or break an application. By and large, Ruby provides enough native primitives such as Array, Hash, and Set, to get you by in most cases, but we've all ran into situations where the performance or the memory footprint of these structures left us wanting for more.

Sensing a nice opportunity, Kanwei Li and Austin Ziegler partnered up for the '08 Google Summer of Code (GSoC) to translate some of the most well known algorithms and data structures into Ruby, resulting in implementations of nine different containers and ten different search and sorting algorithms! And while I would encourage you to explore the entire project, a few aspects deserve special attention: first, there is the question of performance of sorting algorithms in Ruby, and second, Trie, Heap and Priority Queues are data structures worth keeping in mind.

Sorting Performance in Ruby

Turns out, the default sorting implementation in Ruby is very hard to beat. As part of the GSoC project, Kanwei Li implemented the naive (bubble sort) and all the way up to some of the best in the trade (quicksort), but the performance of all of them is well below the native implementation.

A pass through the MRI source code reveals that the the Array.sort! is implemented via quicksort. And while the benchmarks do show a four times performance hit when compared to the raw C++ version sorting an array of integers, you have to remember that an equivalent implementation in Ruby is operating on array of integer objects! In other words, native Array.sort! is a highly optimized piece of code: use it, embrace it, and don't worry about it!

Trie: Ordered Tree Data Structure

Trie is both an incredibly useful and an often overlooked data structure. More commonly known as a 'prefix tree', it can be an very compact way to store and pattern match on a collection of objects. A good example is working with a set of strings which are used as keys to a value, but the same pattern could also be used to match on any sequence of objects. For example, a Trie can be used to implement a route pattern matcher in Rails. Of course, an example is worth a thousand words:

t = Containers::Trie.new

#       key     value
t.push("Hello", "World")
t.push("Hilly", "Billy")
t.push("Hello Bob", "Heya")

# regular lookup
t["Hilly"] # => "Billy"

# find longest matching prefix
t.longest_prefix("Hello stranger") # => "Hello"

# find all keys that start with H*llo (* is a wildcard)
t.wildcard("H*ll*") # => ["Hello", "Hilly"]

Heaps and Priority Queues

Heaps) and Priority Queues are often a drop in performance replacement for a hand sorted hash. Unlike a simple associative container, heaps and priority queues build up an internal sorted tree, which allows you to have constant time access to the highest (or lowest) priority object without having to sort or manually groom the container. Best of all, the underlying Ruby implementation uses a Fibonacci heap, which means constant time lookup and inserts, and O(log n) removal:

pq = Containers::PriorityQueue.new

pq.push("Job A", 15)
pq.push("Job B", 12)
pq.push("Job C", 33)

pq.pop # => "Job C"
pq.pop # => "Job A"
pq.size # => 1
algorithms.git - Ruby Algorithms

Exploring data structures & GSoC '09

The project also contains implementations of the KD-Tree, Deque, Stack), Suffix Array, and a few others, all of which are definitely worth exploring for educational value. For those who are curious, take a look at Red-Black Trees and Bloomfilters as well.

Kudos to Kanwei Li and Austin Ziegler for their great work and it is also worth highlighting that the student applications for the '09 projects are due the end of the end of next week! There is a great lineup of project ideas for Rails and some great mentors have recruited their time to help with the cause. If you're a student, or have a friend who is into Ruby and qualifies, do let them know! It is an amazing opportunity to learn from the best and make a great contribution at the same time.


Ilya Grigorik

Ilya Grigorik is a web performance engineer and developer advocate at Google, where his focus is on making the web fast and driving adoption of performance best practices at Google and beyond.