Bitquark

The most popular subdomains on the internet

Published

Fuzzing is fun. But fuzzing is even more fun when you have a solid wordlist to work with. When it comes to hunting down subdomains there are a few lists out there to plug into your fuzzer, but most are small, one-shot affairs. I set out to build a list of popular subdomains which was comprehensive and could be easily kept up-to-date.

For this project I needed to get hold of DNS records. A lot of DNS records. After trying various sources, I settled upon Rapid7's Project Sonar Forward DNS data set, which includes "... regular DNS lookup for all names gathered from the other scan types, such as HTTP data, SSL Certificate names, reverse DNS records, etc". Rapid7's data set uses a really nice mix of real-world sources and is regularly updated. Perfect.

The first challenge was how to handle 1.4 billion (68 GiB) raw DNS records in a reasonable amount of time; my first attempt at processing the data took well over a week to complete on a reasonably beefy server, hardly ideal for updating frequently.

Here's the process I needed to optimise:

  1. Trim non-domain name data and de-dupe.
    The data set contains all DNS record types (mx, txt, cname, etc), so dupes are common.
  2. Remove suffixes using a list built from the Public Suffix List.
    .com, .co.uk. .ninja, etc need to go so we can properly distinguish subdomains from domains.
  3. Extract subdomains and tally up the number of times each occurs.
    One point per subdomain per domain to keep things fair.
  4. Sort the results by tally.
    And we're done!

Optimisation

The second step, removing domain suffixes, took the most time, this part alone taking days. I spent a good while trying out different solutions, from a parallel sed with a sizeable regex to native bash processing. I eventually settled on a Python script and was able to cut the processing time right down to just under 2 hours.

Next I trained my eye on the system commands I was using. A considerable amount of time (many hours) was taken up by awk, sort, uniq and company. I did some research and with one fell swoop cut these down to mere minutes by setting the LC_ALL=C environment variable, which is perfect for working with subdomains. For more information on why this works, see Jacob Nicholson's blog post on the topic.

Selection

The next question was which part of the domain to use. Just the left-most subdomain? Explode and tally all subdomain parts? Use all of the subdomain parts as one entry? I tried the latter approach and the results were more than a little messy. I didn't find much accord in how people structure their sub-subdomains.

Going with the first option and using just the left-most subdomain I got really good results, with a clear list of winners emerging. I also found that using this list recursively covered the majority of sub-subdomain uses. Win-win. I didn't try splitting and using all subdomain parts, but I strongly suspect diminishing returns.

After the above optimisations the whole process takes around 2.5 hours of grunt work on a server with a 10-core (20-thread) Intel Xeon running at 3 GHz, with 48 GB of memory. Performance would likely be improved by feeding data from an SSD.

Subpop and the 2016-02-27 data set

The results

Without further ado, the 50 most popular subdomains as of 27th February 2016:

CountSubdomain
20,395,943wwwObviously!
1,090,647mail
258,838remoteEveryone loves remote access
168,575blog
133,529webmail
129,202serverCreative
100,849ns1
92,737ns2
73,465smtp
72,115secureIf you need a subdomain for this you have issues
68,339vpn
63,883mLots of mobile sites
62,808shop
60,777ftpStill going strong
58,484mail2
44,481testWell, hello
44,115portal
43,645ns
43,624ww1
42,235host
40,726support
40,107devHello again
37,666webWhen www isn't enough
37,345bbsYes, really. Looking into this, many people equate "BBS" with "Forum"
37,131ww42Domains parked with a large domain squatter
37,069mx
36,876email
36,870cloudFluffy
35,5841
35,481mail1
34,4752
33,696forum
31,291owaGood old Outlook
31,254www2
30,392gw
29,916adminLikely a good target
29,763store
29,251mx1
29,124cdn
29,083api
28,691exchange
28,475app
26,728govUhm
26,4592ttyMostly from .asia and .pw TLDs
26,229vps
24,964govyty"
24,951hgfgdf"
24,768news
24,5211rer"
24,395lkjkui"

There are some predictable subdomains in the list (www, mail, ftp, etc), some more unexpected results (I didn't expect bbs to be so popular), and some results near the bottom (but which continue off the bottom of the list) such as dsasa and hgfgdf for which I haven't been able find an origin, but seem to mostly fall under the .asia and .pw top level domains. If you know anything about these, drop me a comment below.

Code

I've made the code available in the DNSpop github repo in case you want to make any changes or maintain your own list. I've also posted the most popular 1,000, 10,000, 100,000 and 1,000,000 subdomains using the latest data set. The top 1 million subdomains is probably overkill for most fuzzing applications, but the 10k and 100k lists offer pretty good coverage.

Thanks go to Stephen Haywood, who coincidentally maintains an AXFR subdomain list, for discussion around optimising the subdomain stripping process. Additional thanks to Motoko for reminding me that the pv command exists, allowing progress to be shown, and giving hope, during processing.