Consequently, if your site downloads page assets from multiple hosts — often referred to as domain sharding — make sure they all have separate IP addresses.
One of my pet projects at LOVEFiLM is improving the client-side performance of the site, and as part of this effort I recently implemented sharding for page assets. Here's what YSlow has to say about the performance benefits of splitting page components across multiple domains:
Split Components Across Domains
Splitting components allows you to maximize parallel downloads. Make sure you're using not more than 2-4 domains because of the DNS lookup penalty. For example, you can host your HTML and dynamic content on
www.example.organd split static components between
We set up CNAME records for
images4.lovefilm.com, and I implemented a consistent and (mostly) balanced hashing algorithm that determined which server a given asset should be served from. After that it was a simple matter of adding an output filter that converted single-domain static asset URLs into their sharded equivalents.
Once we flipped the switch, total page load time for the LOVEFiLM homepage dropped by a third. For that briefest of moments, life was good and I was a hero. Go me!
Timeout woes, SYN Flood to Host
Unfortunately, days later we started to get a steady trickle of customers complaining that they were getting timeout errors when accessing the LOVEFiLM website. They were reporting that the first page loaded, but most (though not all) of the images were broken. Any subsequent page requests all failed with a timeout error. Other than the symptoms, the customers had very little in common; ISPs, operating systems and browsers all seemed to be affected proportionately to our visitor stats.
The problem turned out to be caused by a well-intentioned but ultimately misguided setting baked into the stateful firewall built into certain consumer-grade ADSL routers. These routers track the number of unfinished TCP connections — that is, outbound TCP connections where the
SYN packet has been sent but the router has yet to see a
SYN ACK response from the server, otherwise known as embryonic connections — to each IP address. If the number of unfinished TCP connections to an individual IP address exceeds a given threshold, all subsequent packets to that IP are silently dropped for a period of 5 minutes. In the user's web browser, this results in timeout errors for any requests that did not make it through before the door was slammed shut.
The setting in question is commonly labelled Maximum unfinished TCP/UDP connections per host. On some devices such as the Belkin F5D7630 this setting is configurable through a hidden page in the router's web-based admin interface, but on others the threshold is simply baked into the firmware and cannot be changed. Worse, some devices ship with a default value as low as 10 for this setting. Modern web browsers make anywhere between 6 and 15 HTTP connections per hostname, so loading static assets from more than one hostname is almost certain to trigger this rule.
The only clue a user would have that their router was causing the connection to be blocked is the SYN Flood to Host entry in their firewall logs:
07/13/2010 21:02:38 **SYN Flood to Host** 192.168.2.4, 55112->> ↵ 220.127.116.11, 80 (from ATM1 Outbound)
I can only assume that this setting is an attempt lessen the effectiveness of DDoS attacks from the client-side. A noble intention, to be sure, but preventing or lessening the effectiveness of DDoS attacks on websites is not something I would consider to be within the domain of a consumer-grade ADSL modem. By all means protect the user against inbound DoS/DDoS attacks, but blocking outbound traffic based on what the router manufacturer deems to be normal usage seems like a step too far.
Once I knew what I was looking for, a quick search revealed that Google maps has also been bitten by this issue (though the helpful Google employee didn't seem to realise the full extent of the problem) and that even tech-savvy users don't know what's going on and may blame their web browser.
The solution: use unique IP addresses for sharded asset hosts
Having found the cause, the fix was simple: assign
images4.lovefilm.com their own IP address. I managed to get hold of a couple of the affected devices, and confirmed that making this change solved the issue.
Steve Souders' aforementioned performance recommendation in Even Faster Websites includes this advice:
Browsers enforce the "maximum connections per server" constraint based on the hostname in the URL, not the IP address to which it resolves… This is good news for people who want to split their content across multiple domains. It's not necessary to deploy additional servers. Instead, a CNAME record for the new domain can be used. Even though the domain names point to the same server, the browser still opens the maximum number of connections for each unique hostname
Splitting assets across hostnames incredibly beneficial for performance, but in light of what we've learned I think the specific passage of advice above should be revisited. You don't need to have more than one server, but you should bind that server to multiple IP addresses and have each hostname use a different IP address.
Addendum: Affected devices
The devices I've confirmed have this setting are:
- Belkin: F5D82XX, F5D76XX
- Philips SNA5630NS
- SMC Baricade SMC7404BRA
- 3com: 3CRWE754G72A
All of these devices are fairly long in the tooth, but while they may not be cutting edge network kit, they are still used by enough of our customers for it to show up as a problem. The Philips router was, once upon a time, rebadged and issued to TalkTalk customers, the UK's most popular broadband provider, and they're widespread enough to have popped up on our radar.
I've not been able to find any modern devices which exhibit this behaviour. My Billion BiPAC 7800N does have a setting equivalent to the above – in this case labelled Maximum TCP Open Handshaking Count – but it defaults to 100 embryonic connections per second rather than 10 or 15 in total.