Concerns over the privacy and security of web users have historically surrounded the web advertising industry. Ad agencies have used increasingly sophisticated methods to profile and target users, including taking advantage of social media platforms like Facebook to serve ads to the perceived most receptive audience. This has raised questions about the ethics of the industry and its collaborators, but also motivated the development of ad-blocking technology as countermeasures. Online content creators and businesses rely on ads as a basis for providing free services sustainably.
To investigate the spread of web advertising, an experiment was performed to estimate roughly what percentage of websites serve advertisements.
The sample data used as the basis for this experiment is sourced from the Common Crawl project. A total of 22,298 May 2018 web crawl archives (latest at time of writing) were processed.
To determine whether a page serves advertisements, an Easylist ruleset file was used. The rules were transformed to be valid regular expressions and any ad-block extension specific options were stripped.
A utility was developed to parse Common Crawl project’s WARC archives and evaluate the ruleset against web-pages. The code for it is available here.
As the archives are located on a US East S3 bucket, executing the parser on a AWS US East EC2 instance is recommended due to the significantly higher bandwidth possible.
Despite this, out of access convenience, it was executed on a Digital Ocean droplet (sign-up via the link to receive $10 credit) with 8 cores and 16 GB of memory.
Of the 22,310,889 domains processed, 52.63% (11,742,112) were found to serve ads.
There are approximately 5% more (1,173,335) websites serving advertisements than those which do not.
A margin of error should be accounted for, as:
- Advertisements could be undetected on webpages, due to a lack of a suitable ad-matching pattern.
- False positives could arise from a regex rule incorrectly matching text that is assumed to be an ad.
- If to consider the webgraph, the websites with which the crawler was initially seeded may be located in a neighbourhood that tend to serve advertisements.
- Web pages that are blank, or those that are under construction or under maintenance typically feature no content, resulting in designation as being ad-free. Under regular operation, however, the pages could feature ads.
The table below enumerates the top-level domains (TLDs) encountered during analysis. In addition, a comparison of the number of sites registered at those TLDs serving ads versus not serving ads, in descending order, is detailed. The rankings of the TLDs closely align with the TLD popularity statistics provided by Statista. Only the top 50 TLDs are displayed. A complete list can be downloaded here.
|Domain||Ad Serving||Ad Free|
Intel Hyperscan Library Performance
The utility developed for the purpose of this experiment which matches advertisment patterns against the Common Crawl archives, is backed by the Intel’s Hyperscan regex engine. The Hyperscan project team claims to offer the fastest performance compared with other PCRE compatible engines. Performance tests contrasting Hyperscan with Google’s RE2 engine can be found on the official project website.
Time taken to match the set of patterns for each web page encountered were recorded throughout the duration of the experiment to offer some independent insight into performance. None of the recommended flags appropriate for the use-case were enabled during pattern compilation. From local testing, the performance was satisfactory for the experiment. As can be seen from the statistics detailed below, the majority of web pages were matched against in less than 10 milliseconds, with a left skewed asymmetric distribution of matching times.
The following frequency table (with inclusive interval classes) depicts the recorded pattern matching times per page.
|0 - 10||772,719,908||775,192,565|
|10 - 20||5,446,585||779,903,838|
|20 - 30||783,810||780,649,003|
|30 - 40||284,638||780,919,652|
|40 - 50||136,559||781,068,239|
|50 - 60||126,263||781,203,324|
|60 - 70||88,727||781,295,903|
|70 - 80||55,125||781,352,287|
|80 - 90||31,656||781,386,068|
|90 - 100||16,036||781,400,194|
|100 - 110||2,921||781,402,828|
|110 - 120||725||781,403,492|
|120 - 130||170||781,403,642|
|130 - 140||42||781,403,687|
|140 - 150||32||781,403,717|
|150 - 160||11||781,403,729|
|160 - 170||14||781,403,745|
|170 - 180||10||781,403,755|
|180 - 190||10||781,403,767|
|190 - 200||5||781,403,775|
|200 - 210||4||781,403,779|
|210 - 220||2||781,403,781|
|220 - 230||5||781,403,787|
|240 - 250||1||781,403,789|
|250 - 260||4||781,403,795|
|260 - 270||8||781,403,802|
|270 - 280||8||781,403,810|
|280 - 290||5||781,403,815|
|290 - 300||4||781,403,820|
|300 - 310||3||781,403,822|
|310 - 320||2||781,403,824|
|330 - 340||1||781,403,825|
|350 - 360||1||781,403,826|
|450 - 460||1||781,403,827|
Header image derived from and courtesy of Wikimedia.