Efficient Updates & Data Import in Rails

/* Typical ActiveRecord insert * number of records = slow */
mysql> insert into widgets (title, price) values ('gizmo',5);
Query OK, 1 row affected (0.01 sec)

/* Multi-record insert with ar-extensions * 1 insert = fast */
mysql> insert into widgets (title,price) values ('gizmo',5),('super-gizmo',10),('hammer',6);
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0

ActiveRecord is a fantastic database abstraction layer and is rightfully one of the core pillars of Rails, but that doesn't mean we can't improve it. In fact, most Rails developers become painfully aware of this fact whenever they try to import or update a large number of records. The process of instantiating each object, validating it, and doing an individual save in a transactional database can be rather painful. Thankfully, Zach Dennis wrote a great gem (ar-extensions), which among other things, gives us access to the native SQL 'import' functions geared at exactly this task: take a large update transaction, and do it all at once - fast.

SQL insert syntax - MySQL

If you look at the MySQL documentation, you'll notice that INSERT is not limited to a single set of values. In fact, we can provide as many objects as we want, as long as our query fits into the 'max_packet_size' of your database. Hence, by default Rails will issue an insert request per object, and our goal is to insert them all at once. It is also worth mentioning the "on duplicate key" directive, you'll see how handy it can be in just a second (hint: updates).

Extending ActiveRecord with ar-extensions

Zach's gem is as un-intrusive as it can get. Install it gem install ar-extensions, add a require statement in your controller and you're virtually done. Let's take a look at some of the available import scenarios:

# Assume a simple Widget model, with three columns: id, title, price

# Import an array of objects - single insert statement
widgets = [ Widget.new(:title => 'gizmo', :price => 5),
            Widget.new(:title => 'super-gizmo', :price => 10)]
Widget.import widgets

# Perhaps you don't want to instantiante the objects, no problem..
# - The order of values within the 'values' array should match the order of columns
columns = [:title, :price]
values = [['gizmo', 5], ['super-gizmo', 6]]
Widget.import columns, values

And now for bonus points: an update is really an insert, but with a duplicate condition. For example, if we make the title column unique, then inserting an object with an existing title will raise a 'duplicate key error', that is, unless we provide the 'on duplicate key' clause in our original query (very clever, aren't we!):

# Assume a simple Widget model, with three columns: id, unique(title), price

# Increment the price of widgets by one (from previous insert)
widgets = [ Widget.new(:title => 'gizmo', :price => 6),
            Widget.new(:title => 'super-gizmo', :price => 11)]
Widget.import widgets, :on_duplicate_key_update => [:price]

# Likewise, we can perform the same task without instantiating the objects
columns = [:title, :price]
values = [['gizmo', 6], ['super-gizmo', 11]]
Widget.import columns, values, :on_duplicate_key_update => [:price]

If you're going for speed, you can also omit ActiveRecord validations by providing a :validate => false option on your import query. At the very least, you should see a 10x performance improvement, which is something definitely worth working towards. For more great ar-extensions recipes, be sure to check out Zach's post on the RubyInside advent calendar, and the RDoc documentation itself.


Ilya Grigorik

Ilya Grigorik is a web performance engineer and developer advocate at Google, where his focus is on making the web fast and driving adoption of performance best practices at Google and beyond.