Fuzzing is fun. But fuzzing is even more fun when you have a solid wordlist to work with. When it comes to hunting down subdomains there are a few lists out there to plug into your fuzzer, but most are small, one-shot affairs. I set out to build a list of popular subdomains which was comprehensive and could be easily kept up-to-date.
For this project I needed to get hold of DNS records. A lot of DNS records. After trying various sources, I settled upon Rapid7's Project Sonar Forward DNS data set, which includes "... regular DNS lookup for all names gathered from the other scan types, such as HTTP data, SSL Certificate names, reverse DNS records, etc". Rapid7's data set uses a really nice mix of real-world sources and is regularly updated. Perfect.
The first challenge was how to handle 1.4 billion (68 GiB) raw DNS records in a reasonable amount of time; my first attempt at processing the data took well over a week to complete on a reasonably beefy server, hardly ideal for updating frequently.
Here's the process I needed to optimise:
- Trim non-domain name data and de-dupe.
The data set contains all DNS record types (mx, txt, cname, etc), so dupes are common.
- Remove suffixes using a list built from the Public Suffix List.
.com, .co.uk. .ninja, etc need to go so we can properly distinguish subdomains from domains.
- Extract subdomains and tally up the number of times each occurs.
One point per subdomain per domain to keep things fair.
- Sort the results by tally.
And we're done!
The second step, removing domain suffixes, took the most time, this part alone taking days. I spent a good while trying out different solutions, from a parallel sed with a sizeable regex to native bash processing. I eventually settled on a Python script and was able to cut the processing time right down to just under 2 hours.
Next I trained my eye on the system commands I was using. A considerable amount of time (many hours) was taken up by awk, sort, uniq and company. I did some research and with one fell swoop cut these down to mere minutes by setting the LC_ALL=C environment variable, which is perfect for working with subdomains. For more information on why this works, see Jacob Nicholson's blog post on the topic.
The next question was which part of the domain to use. Just the left-most subdomain? Explode and tally all subdomain parts? Use all of the subdomain parts as one entry? I tried the latter approach and the results were more than a little messy. I didn't find much accord in how people structure their sub-subdomains.
Going with the first option and using just the left-most subdomain I got really good results, with a clear list of winners emerging. I also found that using this list recursively covered the majority of sub-subdomain uses. Win-win. I didn't try splitting and using all subdomain parts, but I strongly suspect diminishing returns.
After the above optimisations the whole process takes around 2.5 hours of grunt work on a server with a 10-core (20-thread) Intel Xeon running at 3 GHz, with 48 GB of memory. Performance would likely be improved by feeding data from an SSD.
Without further ado, the 50 most popular subdomains as of 27th February 2016:
|Everyone loves remote access
|If you need a subdomain for this you have issues
|Lots of mobile sites
|Still going strong
|When www isn't enough
|Yes, really. Looking into this, many people equate "BBS" with "Forum"
|Domains parked with a large domain squatter
|Good old Outlook
|Likely a good target
|Mostly from .asia and .pw TLDs
There are some predictable subdomains in the list (www, mail, ftp, etc), some more unexpected results (I didn't expect bbs to be so popular), and some results near the bottom (but which continue off the bottom of the list) such as dsasa and hgfgdf for which I haven't been able find an origin, but seem to mostly fall under the .asia and .pw top level domains. If you know anything about these, drop me a comment below.
I've made the code available in the DNSpop github repo in case you want to make any changes or maintain your own list. I've also posted the most popular 1,000, 10,000, 100,000 and 1,000,000 subdomains using the latest data set. The top 1 million subdomains is probably overkill for most fuzzing applications, but the 10k and 100k lists offer pretty good coverage.
Thanks go to Stephen Haywood, who coincidentally maintains an AXFR subdomain list, for discussion around optimising the subdomain stripping process. Additional thanks to Motoko for reminding me that the pv command exists, allowing progress to be shown, and giving hope, during processing.