Subqueries and SQL Syntax Sugar Diet

In theory, theory and practice are the same, in practice they are usually not. Likewise, the syntactic sugar provided by modern SQL databases comes with many caveats. In MySQL a great example appears to be the subquery sugar of: SOME, IN and ANY. In particular, a query with an IN clause is theoretically equivalent to a UNION of several queries, albeit in a shorter format. And in fact, if you were to execute a UNION and an IN, you would get the same results. But here is the catch, don't expect the execution time to be identical also!

Putting your SQL on a diet

As any programmer, I tend to prefer the shorthand over the complete, and hence my SQL queries are often populated with neat syntactic tricks. Most recently, this syntax sugar-tooth addiction has led me to create the following query:

# Select widgets from a supplied list (often 500+ ids in the list)
SELECT name,price FROM widgets WHERE item_id in (1,2,3, ... , x)

The query is a dead simple union across multiple supplied ids, but as I found out, the query optimizer does not necessarily recognize it as such (MySQL 5.0.38). The execution plan returns the expected results, but the execution time is significantly slower when compared to a hand-crafted UNION. Let's see how we can reconstruct this query with vanilla SQL syntax:

# Generate SQL for individual queries
sql = widget_ids.collect { |w| "(SELECT name,price FROM widgets id = #{w})" }
  
# Take the union of all subqueries
sql = sql.join(" UNION ")

# Return all results and sort by price
result = Widget.find_by_sql("#{sql} order by price")

Sure enough, the returned result sets are identical, but for a query containing several hundred ids, the execution time appears to be dramatically different (826 unique ids):

require 'rubygems'
require 'active_record'

ActiveRecord::Base.establish_connection(
    :adapter => "mysql",
    :database => "yourdb",
    :username => "login",
    :password => "pass",
    :host => "localhost"
)

class Widget < ActiveRecord::Base;end

def time_query(&block)
  start = Time.now
  block.call
  puts "#{Time.now - start} seconds"
end

widgets = [1,2, ... , x] # (826 ids for test case)

# Syntax sugar galore : 8.565 seconds
time_query { 
    puts "IN query"
    user_stories = Entry.find_by_sql("SELECT name,price FROM widgets WHERE id in (#{widgets.join(',')})")
}

# SQL on a diet: vanilla UNION way : 0.773 seconds
time_query { 
    puts "UNION query"
    # Generate SQL for individual queries
    sql = widget_ids.collect { |w| "(SELECT name,price FROM widgets WHERE id = #{w})" }

    # Take the union of all subqueries
    sql = sql.join(" UNION ")
    
    # Return all results and sort by price
    result = Widget.find_by_sql("#{sql} order by price")
}

In the end, even with an additional ORDER BY clause, the UNION query is about 10 times faster (8.565s vs 0.773s). That's nothing to sneeze at! Perhaps it's time you put your SQL syntax on a diet as well?

Ilya GrigorikIlya Grigorik is a web ecosystem engineer, author of High Performance Browser Networking (O'Reilly), and Principal Engineer at Shopify — follow on Twitter.