Masking Latency & Failures with Squid

By Ilya Grigorik on August 05, 2009

Latency matters. Watching the interviews with the Yahoo developers on the launch of their new homepage, I was once again reminded about the great amount of effort they have put in to make it all work. In fact, I've written about Mark Nottingham's proposals for stale-while-revalidate and stale-if-error in the past (with a homebaked implentation), but a more realistic deployment scenario is to use a caching server like Squid or Varnish. Let's take a look at how these extensions can help your specific deployment.

HTTP Caching Extensions: stale-*

Both stale-while-revalidate and stale-if-error are the direct results of Mark Nottingham's work at Yahoo. It is clear that Squid plays a big role in the Y! infrastructure and these extensions are undoubtedly deployed in their data centers. Stale-while-revalidate addresses a simple problem: when a record becomes stale, instead of allowing the request to hit the application server, serve the stale data to the client and create an asynchronous request to update the cache (read more & spec). The benefit? All of your customers see consistent performance because the data is always served out of the cache (think RSS feeds, search results, etc).

The second extension (stale-if-error) can help you mask downtime by returning stale data while your ops team resolves the problem. How does it work? You specify a cache-control header (Cache-Control: max-age=600, stale-if-error=1200) which indicates for how long, after the record is expired the cache server can serve this data if the application server is down. Of course, nothing stops you from setting this to a high value like a day or longer! This way, if your server goes down, at least the clients won't timeout on their requests while you're resolving the problem!

Setting up Squid as a Reverse Proxy

Varnish and Squid are the two top choices when it comes to caching reverse proxies. Varnish has been getting a lot momentum but unfortunately the features we're looking for are still in the works. There is an open ticket for stale-if-error support, and stale-if-revalidate implementation unfortunately does not provide the full asynchronous refresh model. For that reason, we're going to use Squid 2.7 (note: Squid 3.0 is a full rewrite of the 2.x branch, and while stale-* patches exist, they haven't officially made it into the tree). A minimal Squid config (if there is such a thing) to get us up and running:

# http_port public_ip:port accel defaultsite= default hostname, if not provided
http_port 0.0.0.0:80 accel defaultsite=yourdomain.com

# IP and port of your main application server (or multiple)
cache_peer 192.168.0.1 parent 80 0 no-query originserver name=main
cache_peer_domain main yourdomain.com

# Do not tell the world that which squid version we're running
httpd_suppress_version_string on

# Remove the Caching Control header for upstream servers
header_access Cache-Control deny all

# log all incoming traffic in Apache format
logformat combined %>a %ui %un [%tl] "%rm %ru HTTP/%rv" %Hs %<st "%{Referer}>h" "%{User-Agent}>h" %Ss:%Sh
access_log /usr/local/squid/var/logs/squid.log combined all

squid.conf - Full Squid 2.7 Reverse Accelerator config

Connecting Squid and your Application

Once Squid is deployed as part of your request chain, taking advantage of its caching mechanism is a trivial matter, just add the "Cache-Control" header! A simple Rack application to make it all work:

require "rack"

app = lambda {
  p [:new_request, Time.now]

  headers = {
    # cache for 10 seconds, serve stale for up to 20s, and for up to 30s on errors
    "Cache-Control" => "max-age=10, stale-while-revalidate=10, stale-if-error=20",
    "Last-Modified" => Time.now.to_s
  }

  [200, headers, "Hello World @ #{Time.now}"]
}

Rack::Handler::Mongrel.run(app, {:Host => "127.0.0.1", :Port => 80})

The only requirements are that we set a Last-Modified time, and then provide our preferred caching intervals. Max-age specifies the general lifetime of the object (TTL), stale-while-revalidate indicates for how long after max-age has expired can the server provide the stale data (total lifetime is max-age + stale-while-revalidate). Finally, stale-if-error indicates for how long the server can provide the stale data if the application server is down - it is not unusual to set this higher than stale-if-revalidate. That's it!

Masking Latency and Failures

Of course, these extensions do not excuse any of us from building fast and reliable web-services. In theory, we would have no need for these extensions, but as the saying goes: "in theory, theory and practice are the same, in practice, they are not." If you have data that can be served slightly stale (frankly, the majority of the data falls into this category), then Squid can make all the difference next time your server decides to take a break at 4AM.

Ilya Grigorik is a web ecosystem engineer, author of High Performance Browser Networking (O'Reilly), and Principal Engineer at Shopify — follow on Twitter.