Network Explosion

Wired router spontaneously bursts into flames.

Well, that is not entirely true.  There was, in fact, not even any smoke, but one of our wired routers got extremely hot and failed in a worse way.  Instead of a spectacular failure, one that would be immediately identifiable, this router simply began dropping packets randomly, but only on certain paths and for certain protocols.  A complete reboot of the entire network (all servers and equipment) had no effect on the problem.

The way the issue manifest was by almost completely cutting off my office machines from the internet.  I was not able to fetch my email or read web pages from my systems except (annoyingly) every once in a great while when the properly routed packets aligned correctly.  However, I have a rudimentary diagnostic script (pings from a batch file) that I use to identify the source of a failure, and it registered 100% success; I was able to ping to any reasonable internet address and get a proper response.  Likewise, I was able to connect directly to the servers without any difficulty; it was only when trying to use them normally (via the internet) that I got no response.

Finding the source of failure became a bit more difficult because of the particularly aberrant behavior of this router.  The servers that reside behind that router were able to access the internet without any signs of a problem.  Complicating the matter even more was my own failure to confirm the network topography and, instead, incorrectly assuming that the wireless router (which clients had no trouble, either) went through the same router.  Since my office seemed to be the only area affected, the obvious suspect was the local switch, or else the cable (or port) connecting it to the rest of the network.  I had, in fact, already sent out for replacements when I was able to determine (with about 80% certainty) that it was actually this odd failure of the router on the main network.

Replacement router serves its purpose, barely.

As usual, the router failure came at a very inopportune time, right in the middle of a big development push.  Instead of any network reconfiguration, I made the call to simply replace like for (almost) like.  In theory, I could just drop in the new router, configure it the same as the old router, and carry on.  The problem was that the old hardware was Linksys, of pre-Cisco vintage, and the available replacement was D-Link.  Most of the settings translated fairly directly, but differences in era and manufacturer meant that it took a little extra time to find everything and figure it all out.

The biggest issue, however, is that the new router has an apparent design problem not inherent in the old Linksys.  The replacement hardware cannot properly handle loopback connections.  The link explains this in detail, but the gist of a loopback connection error is that a router sends internal packets out to the internet even if they are destined for an address the router handles.  In other words, I can reach my servers behind the router using a private address, but if I try to use the public address (say, ‘sophsoft.com’), it sends my packets to Neverneverland.

Fortunately, the problem only impacts machines behind the same router as the servers, which in practical terms means that it only affects me and my development systems.  I reconfigured a few settings here to work around the limitation in the hardware, and everything seems to be working fine.  The weird thing is that the rest of the world could reach the servers fine, but it is hard to accept that when the closest systems to them (both physically and in network terms) could not.  I was able to test from other systems and from external services.  In particular, I found SuperTool from MxToolbox particularly helpful.

In the midst of this, I also dealt with a stupidity problem with Linux, but that tale will have to wait for another day…