The prevalence of Web advertising

Concerns over the privacy and security of web users have historically surrounded the web advertising industry. Ad agencies have used increasingly sophisticated methods to profile and target users, including taking advantage of social media platforms like Facebook to serve ads to the perceived most receptive audience. This has raised questions about the ethics of the industry and its collaborators, but also motivated the development of ad-blocking technology as countermeasures. Online content creators and businesses rely on ads as a basis for providing free services sustainably.

To investigate the spread of web advertising, an experiment was performed to estimate roughly what percentage of websites serve advertisements.

Method

The sample data used as the basis for this experiment is sourced from the Common Crawl project. A total of 22,298 May 2018 web crawl archives (latest at time of writing) were processed.

To determine whether a page serves advertisements, an Easylist ruleset file was used. The rules were transformed to be valid regular expressions and any ad-block extension specific options were stripped.

A utility was developed to parse Common Crawl project’s WARC archives and evaluate the ruleset against web-pages. The code for it is available here.
As the archives are located on a US East S3 bucket, executing the parser on a AWS US East EC2 instance is recommended due to the significantly higher bandwidth possible.
Despite this, out of access convenience, it was executed on a Digital Ocean droplet (sign-up via the link to receive $10 credit) with 8 cores and 16 GB of memory.

Result

Of the 22,310,889 domains processed, 52.63% (11,742,112) were found to serve ads.

There are approximately 5% more (1,173,335) websites serving advertisements than those which do not.

A margin of error should be accounted for, as:

  • Advertisements could be undetected on webpages, due to a lack of a suitable ad-matching pattern.
  • False positives could arise from a regex rule incorrectly matching text that is assumed to be an ad.
  • If to consider the webgraph, the websites with which the crawler was initially seeded may be located in a neighbourhood that tend to serve advertisements.
  • Web pages that are blank, or those that are under construction or under maintenance typically feature no content, resulting in designation as being ad-free. Under regular operation, however, the pages could feature ads.

The table below enumerates the top-level domains (TLDs) encountered during analysis. In addition, a comparison of the number of sites registered at those TLDs serving ads versus not serving ads, in descending order, is detailed. The rankings of the TLDs closely align with the TLD popularity statistics provided by Statista. Only the top 50 TLDs are displayed. A complete list can be downloaded here.

DomainAd ServingAd Free
com6,277,2544,980,789
net574,751497,506
ru520,743334,954
org463,608449,501
de395,971665,355
uk269,225275,252
jp230,345210,191
cn212,049170,373
nl177,744204,308
pl167,707132,736
fr152,400189,921
it135,943176,304
br125,822115,197
info113,818101,620
cz110,236105,171
au95,97376,878
ca76,71373,359
se74,48967,533
eu71,91777,943
es71,46766,812
ua68,45948,760
ch55,16982,833
be48,78264,325
us43,52150,860
hu42,38638,545
at42,06669,047
in41,62136,320
biz37,78429,397
ro36,65626,702
dk35,41247,771
co33,92523,072
io31,17940,979
edu30,67535,842
mx29,40919,781
tw28,62235,577
cc27,12920,463
no26,85225,680
gr25,87727,757
za25,38626,240
nz23,68322,524
me23,60416,266
sk23,22525,454
fi22,49830,128
ir22,20924,827
tv21,42411,951
ar21,40221,202
kr20,00130,802
vn19,80713,687
pt16,79616,940

Intel Hyperscan Library Performance

The utility developed for the purpose of this experiment which matches advertisment patterns against the Common Crawl archives, is backed by the Intel’s Hyperscan regex engine. The Hyperscan project team claims to offer the fastest performance compared with other PCRE compatible engines. Performance tests contrasting Hyperscan with Google’s RE2 engine can be found on the official project website.

Time taken to match the set of patterns for each web page encountered were recorded throughout the duration of the experiment to offer some independent insight into performance. None of the recommended flags appropriate for the use-case were enabled during pattern compilation. From local testing, the performance was satisfactory for the experiment. As can be seen from the statistics detailed below, the majority of web pages were matched against in less than 10 milliseconds, with a left skewed asymmetric distribution of matching times.

CSV  

The following frequency table (with inclusive interval classes) depicts the recorded pattern matching times per page.

Duration (m/s)FrequencyCumulative
0 - 10772,719,908775,192,565
10 - 205,446,585779,903,838
20 - 30783,810780,649,003
30 - 40284,638780,919,652
40 - 50136,559781,068,239
50 - 60126,263781,203,324
60 - 7088,727781,295,903
70 - 8055,125781,352,287
80 - 9031,656781,386,068
90 - 10016,036781,400,194
100 - 1102,921781,402,828
110 - 120725781,403,492
120 - 130170781,403,642
130 - 14042781,403,687
140 - 15032781,403,717
150 - 16011781,403,729
160 - 17014781,403,745
170 - 18010781,403,755
180 - 19010781,403,767
190 - 2005781,403,775
200 - 2104781,403,779
210 - 2202781,403,781
220 - 2305781,403,787
240 - 2501781,403,789
250 - 2604781,403,795
260 - 2708781,403,802
270 - 2808781,403,810
280 - 2905781,403,815
290 - 3004781,403,820
300 - 3103781,403,822
310 - 3202781,403,824
330 - 3401781,403,825
350 - 3601781,403,826
450 - 4601781,403,827

Adendum

Header image derived from and courtesy of Wikimedia.

Share your thoughts