Testing RDDB: RESTful Ruby Database

By Ilya Grigorik on November 15, 2007

The release of RDDB generated a lot of buzz in the community. Another creation by Anthony Eden, a prolific Ruby coder, RDDB is a document-oriented, RESTful database. The codebase is still young, and yet it already provides three different storage engines, support for partitioning, caching, and custom views. I couldn't resist and decided to take it for a test-drive.

Selecting a storage engine

Anthony provided three different models for data storage and a Mongrel wrapper to abstract the engines, which makes the setup a breeze. The only decision you have to make is that of the underlying data engine: memory, file document, or Amazon S3. Depending on your application, you will have to consider the benefits and tradeoffs of each of these engines. Memory is fast, file document provides persistence at the cost of speed, and Amazon S3 is slow as a general-purpose database but can provide 'infinite scalability'.

In addition, both file document and Amazon S3 engines provide support for data partitioning and server side, in-memory caching. By default, the cache is a simple Ruby hash, but it can also be easily replaced by a 'smarter' structure such as an LRU cache as long as the hash semantics are preserved in the process. This is a great feature as it allows you to minimize the cost of expensive S3 and local filesystem lookups.

RAM (memory) document store

options = {}
document_store ||= Rddb::DocumentStore::RamDocumentStore.new(options)

As you would expect, the implementation of the memory engine relies on an in-memory Ruby hash. At the moment, it provides no support for persistence, and behaves very much like the dearly beloved Memcached server, albeit without the support for the distributed caching layer. The good: in this configuration RDDB is a pure Ruby cache, it does not require any additional software packages, and it is very easy to setup. With clean RESTful semantics, this could be an appealing solution for some applications.

The bad: Ruby hashes are fast, but Memcached is still a lot faster. In addition, Memcached provides a distributed caching layer and a stable codebase with ongoing development, interface libraries in virtually every language, and a number of custom engines built on top.

File document store

# Partition database by author
ps = Proc.new { |document| document.author }
@document_store = Rddb::DocumentStore::PartitionedFileDocumentStore.new(:partition_strategy => ps)

The file document engine takes a very simple approach: concatenate all data into one file and keep a separate index that maps a document id to the start and end offsets of the Marshalled document within the file. When a GET request is received, the file handle is retrieved, a seek and a read are executed, data is un-Marshalled, and finally sent back to the user.

The good: easy to setup and provides data persistence. Data partitioning provides an additional layer of control and caching helps to alleviate the costly lookups.

The bad: concurrent lookups are a no go. Current implementation executes a seek and a read on a file handle, which works great when all requests are serialized, but anytime two or more Mongrel threads attempt this same sequence, the engine comes tumbling down. In addition, the number of data partitions is limited by the maximum number of open file handles in Ruby, which is not very much. These are serious flaws, and I hope Anthony will address them in future releases.

S3 document store

# Partition database by author
ps = Proc.new { |document| document.author }
@document_store = Rddb::DocumentStore::S3DocumentStore.new('bucket', {
    :s3 => {:access_key_id => 'access', :secret_access_key => 'secret!'},
    :partition_strategy => ps
    }
)

This one acts as a wrapper to the popular aws-s3 gem, but otherwise behaves just like the file document store. All data is stored in a single file by default, and support for partitioning and local in-memory caching is present. In effect, RDDB acts as a proxy to Amazon's S3. The good: just as the local document store, this engine is easy to setup and follows the same semantics for caching, data partitioning, and index accesses. The bad: concurrency is less of an issue for this engine, but the implementation of the data store is very inefficient. Every write to S3 results in a full download, concatenation and a write-back to S3. Likewise, accessing a document requires the download of an entire database partition, followed by a range read. Unfortunately, this means that RDDB must act as a proxy for both reads and writes, and many of the advantages of S3 are lost in the process.

Great start, and it should get better

In all, RDDB is a great example of how much can be accomplished with some thoughtfully laid out Ruby code. The codebase is tiny, and yet the product is amazingly feature rich. Having said that, there is definitely a lot of work to be done before RDDB can become a production-type system. I hope that Anthony will continue working on this project and that more developers will sign-on to help with the cause!

Ilya Grigorik is a web ecosystem engineer, author of High Performance Browser Networking (O'Reilly), and Principal Engineer at Shopify — follow on Twitter.