HTTP Archive and Google Big Query

HTTP Archive and Google Big Query

The HTTPArchive has a mission of recording the performance information and makeup of the web.

In addition to the content of web pages, it’s important to record how this digitized content is constructed and served. The HTTP Archive provides this record. It is a permanent repository of web performance information such as size of pages, failed requests, and technologies utilized. This performance information allows us to see trends in how the Web is built and provides a common data set from which to conduct web performance research.

They do not track every domain but limit themselves to the Alexa 1,000,000

Starting in November 2011, the list of URLs is based solely on the Alexa Top 1,000,000 Sites (zip). Use the HTTP Archive URLs page to see the list of the top 10,000 URLs used in the most recent crawl.

The URL is loaded 3 times via IE9 and iPhone4 and the data from the median run is used to populate the HTTPArchive database.

  • the test agents are located in Redwood City, CA
  • the default WebPagetest connection speed is used
  • empty cache (“first view”)
  • Data is collected via a HAR file.

Data is available as a CSV or Mysql dump and is generated on the 1st and 15th of every month.  Full instructions and files are available via their download page [http://httparchive.org/downloads.php].

The results from each crawl are saved as MySQL dump files in both MySQL format and CSV format. Dumps are made for both the desktop and mobile crawls.

Even easier Ilya Grigorik in 2013 started pushing the data into Google Big Query.  His post has the information you need if you want to start doing the queries yourself – https://www.igvita.com/2013/06/20/http-archive-bigquery-web-performance-answers/

Well, good news, now you can satisfy your curiosity in minutes (or seconds, even). The full HTTP Archive dataset is now available on BigQuery! To get started, signup for BigQuery and head to bigquery.cloud.google.com and …

I don’t know much of the history of bigqueri.es other than their ‘about’ page.  The goodness however has come by people asking questions about the HTTP Archive data set and many providing example queries as starting points for your own curiosity.   http://bigqueri.es/c/http-archive

Community for curious minds exploring big data with the help of Big Query!

So what can we do with this?

 

 

 

 

 

Comments are closed.