What does the HTTP Archive do?

The HTTP Archive tracks how the web is built. It provides historical data to quantitatively illustrate how the web is evolving. People who use the HTTP Archive data are members of the web community, scholars, and industry leaders:

  • The web community uses this data to learn more about the state of the web. You may see it come up in blog posts, presentations, or social media.
  • Scholars cite this data to support their research in major publications like ACM and IEEE.
  • Industry leaders use this data to calibrate their tools to accurately represent how the web is built. For example, a tool might warn a developer when their JavaScript bundle is too big, as defined by exceeding some percentile of all websites.

How does the HTTP Archive decide which URLs to test?

The HTTP Archive crawls millions of URLs on both desktop and mobile monthly. The URLs come from the Chrome User Experience Report, a dataset of real user performance data of the most popular websites.

How is the data gathered?

The list of URLs is fed to our private instance of WebPageTest on the 1st of each month.

As of March 1 2016, the tests are performed on Chrome for desktop and emulated Android (on Chrome) for mobile.

The test agents are run from Google Cloud regions across the US. Each URL is loaded once with an empty cache ("first view") for normal metrics collection and again, in a clean browser profile, using Lighthouse. The data is collected via a HAR file. The HTTP Archive collects these HAR files, parses them, and populates various tables in BigQuery.

How accurate is the data, in particular the time measurements?

Some metrics like the number of bytes, HTTP headers, etc are accurate at the time the test was performed. It's entirely possible that the web page has changed since it was tested. The tests were performed using a single browser. If the page's content varies by browser this could be a source of differences.

The time measurements are gathered in a test environment, and thus have all the potential biases that come with that:

  • Browser - All tests are performed using a single browser. Page load times can vary depending on browser.
  • Location - The HAR files are generated from various datacenters in the US. The distance to the site's servers can affect time measurements.
  • Sample size - Each URL is loaded only once.
  • Internet connection - The connection speed, latency, and packet loss from the test location is another variable that affects time measurements.

Given these conditions it's virtually impossible to compare the HTTP Archive's time measurements with those gathered in other browsers, locations or connection speeds. They are best used as a source of comparison only within the HTTP Archive dataset.

How do I use BigQuery to write custom queries over the data?

The HTTP Archive dataset is available on BigQuery. Be aware that as a consequence of collecting so much metadata from millions of websites each month, the dataset is extremely large—multiple petabytes. Care must be taken to set up cost controls to avoid unexpected bills. Also see our guide to minimizing query costs for tips on staying under the 1 TB per month free quota.

Check out Getting Started Accessing the HTTP Archive with BigQuery, a guide for first-time users written by Paul Calvano.

For a guided walkthrough of the project, watch this in-depth 30 minute video featuring HTTP Archive maintainer Rick Viscomi.

If you have any questions about using BigQuery, reach out to the HTTP Archive community at discuss.httparchive.org.

What changes have been made to the test environment that might affect the data?

Oct 1, 2024: IPv6 Support added
IPv6 Support added was added to the crawler allowing it to crawl sites using IPv4 or IPv6 based on Chrome's heuristics.
Aug 1, 2024: DNS detection unavailable for August 2024
DNS detection unavailable for August 2024 leading to a temporary drop in some detections.
Jan 1, 2023: Upgrade agents to Ubuntu 22.04
Upgraded the test agents from Ubuntu 18.04 to 22.04.
Jun 1, 2022: Mobile network throttling changed from 3G to 4G (faster performance)
Note: Lighthouse still running at Fast 3G so this only affects the HTTP Archive crawl.
May 1, 2022: Response bodies unavailable for May 2022
Due to a crawl config error, we don't have response bodies in the May 2022 dataset.
May 1, 2022: Lighthouse desktop integration
Lighthouse testing for Desktop crawl added, in addition to Mobile (which was added in 2017)
Aug 1, 2020: Mobile CPU throttling changed (faster performance)
WebPageTest changed its mobile CPU throttling from cgroups to Chrome's built-in throttling. The cgroups had more overhead than expected when running in VMs and switching to Chrome's built-in throttling eliminated the extra overhead and resulted in more accurate throttling, which led to slightly faster performance.

See also: wptagent#366

Dec 15, 2018: Synced URLs with entire CrUX corpus
Jul 1, 2018: Switched to 1.3M CrUX URLs
Jun 1, 2017: Lighthouse integration
Lighthouse testing was enabled on mobile websites, which gives us access to many new modern web metrics like time to interactive and first meaningful paint, as well as audit results in categories like PWA, SEO, mobile, and more.
May 1, 2017: Switch to Linux test agents

See also: #98

Jan 1, 2017: Desktop data loss

See also: #74

Mar 1, 2016: Change agents from IE to Chrome
As of March 1 2016, the tests are performed on Chrome for desktop and emulated Android (on Chrome) for mobile. Prior to that, IE 8&9 were used for desktop, and iPhone was used for mobile.
Nov 15, 2013: Accidental IE 11 update
Normally we use IE9 as our test agent, but sometime around Nov 15 2013 our test agents started auto-updating to IE11. In the Nov 15 2013 crawl about 7.5% of websites were tested with IE11. Although this affected some results, the impact was judged to be minimal so we left the data intact. In the Dec 1 2013 crawl about 47% of websites were tested with IE11. This had a dramatic affect on results with a 10x increase in failures and a drop from 71% to 34% in responses gzipped. (We are not sure what produced that change.) Because the results changed so significantly the data from the Dec 1 2013 crawl was removed.
Jun 24, 2013: Mobile speed decreased to 3G
The default connection speed for mobile was decreased to an emulated 3G network.
Mar 19, 2013: Desktop speed increased to Cable
The default connection speed was increased from DSL (1.5 mbps) to Cable (5.0 mbps). This only affects IE (not iPhone).
Oct 1, 2012: Stop tests at network idle
The web10 parameter in the WebPagetest API determines whether the test should stop at document complete (window.onload) as opposed to later once network activity has subsided. Prior to Oct 1 2012 the tests were configured to stop at document complete. However, lazy loading resources (loading them dynamically after window.onload) has grown in popularity. Therefore, this setting was changed so that these post-onload requests would be captured. This resulted in more HTTP requests being recorded with a subsequent bump in transfer size.
Sep 1, 2012: Increase number of URLs
The number of URLs tested increased from 200K to 300K for IE, and from 2K to 5K for iPhone.
Jul 1, 2012: Use WPT for mobile testing
The HTTP Archive Mobile switched from running on the Blaze.io framework to WebPagetest. This involved several changes including new iPhone hardware (although both used iOS5) and a new geo location (from Toronto to Redwood City).
May 1, 2012: Increase number of URLs
The number of URLs tested increased from 100K to 200K for IE.
Mar 15, 2012: Change agents from IE 8 to IE 9
Switch from IE 8 to IE 9.

What are the limitations of this testing methodology?

The HTTP Archive examines each URL in the list, but does not crawl the website's other pages. Although this list of websites is well known, the entire website doesn't necessarily map well to a single URL.

Most websites are comprised of many separate web pages. The landing page may not be representative of the overall site. Some websites, such as facebook.com, require logging in to see typical content. Some websites, such as googleusercontent.com, don't have a landing page. Instead, they are used for hosting other URLs and resources. In this case googleusercontent.com is the domain path used for resources inserted by users into Google documents, etc. Because of these issues and more, it's possible that the actual HTML document analyzed is not representative of the website.

What is a lens?

A lens focuses on a specific subset of websites. Through a lens, you'll see data about those particular websites only. For example, the WordPress lens focuses only on websites that are detected as being built with WordPress. We use Wappalayzer to detect over 1,000 web technologies and choose a few interesting ones to become lenses.

Lenses can be enabled at the top of any report, or by visiting the respective subdomain, for example wordpress.httparchive.org.

Who sponsors the HTTP Archive?

The HTTP Archive is sponsored by companies large and small in the web industry who are dedicated to moving the web forward. Our sponsors make it possible for this non-profit project to continue operating and tracking how the web is built.

See the full list of HTTP Archive sponsors.

How do I make a donation to support the HTTP Archive?

Donations in support of the HTTP Archive can be made through the Open Collective.

Who maintains the HTTP Archive?

The current core maintainers are Pat Meenan, Rick Viscomi, Paul Calvano, Barry Pollard, and Max Ostapenko.

Many people have contributed (1, 2) and helped make the HTTP Archive successful over the years. Special thanks to Steve Souders, who started the project in 2010, Pat Meenan who built the WebPageTest infrastructure powering the HTTP Archive, Ilya Grigorik, a long-time core maintainer, and Guy Leech and Stephen Hay for design help along the way.

Who do I contact for more information?

Please go to Discuss HTTP Archive and start a new topic.