HTTP Archive + BigQuery = Web Performance Answers
HTTP Archive is a treasure trove of web performance data. Launched in late 2010, the project crawls over 300,000 most popular sites twice a month and records how the web is built: number and types of resources, size of each resource, whether the resources are compressed or marked as cacheable, times to render the page, time to first paint, and so on - you get the point.
The HTTP Archive site itself provides a number interesting stats and aggregate trends, but the data on the site only scratches the surface of the kinds of questions you can ask! To satisfy your curiousity, all you need to do is download and import ~400GB of raw SQL/CSV data. Easy, right? Yeah, not really. Instead, wouldn’t it be nice if we had the full dataset of all the HTTP Archive data to query on demand, and with ad-hoc questions?
Google BigQuery + HTTP Archive
Well, good news, now you can satisfy your curiosity in minutes (or seconds, even). The full HTTP Archive dataset is now available on BigQuery! To get started, signup for BigQuery and head to bigquery.cloud.google.com and click the down arrow beside “API Project”:
Switch to project > Display project > enter "httparchive"
Once the project is imported, expand the project in left navigation and you’ll see a collection of tables for individual pages and request data for each run of the HTTP Archive crawler. From there, pull up the SQL console, and go nuts! For example, let’s start simple: what is the median time to first render?
SELECT NTH(50, quantiles(renderStart,100)) median, NTH(75, quantiles(renderStart,100)) seventy_fifth, NTH(90, quantiles(renderStart,100)) ninetieth FROM [httparchive:runs.2013_06_01_pages]
And the answer is: 2.2s median, 3.3s for 75th percentile, and 4.7s for 90th percentile. BigQuery provides a number of convenience functions, such as quantiles, statistical approximations, extended regular expression matching and extraction, and a lot more. Check out the query reference and query cookbook to learn more.
One JS framework is not enough!
SELECT pages.pageid, url, cnt, libs, pages.rank rank FROM [httparchive:runs.2013_06_01_pages] as pages JOIN ( SELECT pageid, count(distinct(type)) cnt, GROUP_CONCAT(type) libs FROM ( SELECT REGEXP_EXTRACT(url, r'(jquery|dojo|angular|prototype|backbone|emberjs|sencha|scriptaculous).*\.js') type, pageid FROM [httparchive:runs.2013_06_01_requests] WHERE REGEXP_MATCH(url, r'jquery|dojo|angular|prototype|backbone|emberjs|sencha|scriptaculous.*\.js') GROUP BY pageid, type ) GROUP BY pageid HAVING cnt >= 2 ) as lib ON lib.pageid = pages.pageid WHERE rank IS NOT NULL ORDER BY rank asc
JQuery is the clear crowd favorite, but prototype (despite its age) is holding strong. Worse, there are many sites with three or four different frameworks, and even different versions of each one on the same page!