Yesterday at lunchtime there were some issues on our network.
I’ll try to explain what happened in simple terms and also explain what we are going to do to avoid this type of issue arising in the future.
If anyone has any queries about the explanation please feel free to ask via comments or email us directly.
Timeline: 13:55 – 14:18
Affected Customers: Any customer on the shared firewall that has a dedicated server or has colo with us was affected during this incident. This also included our shared hosting clients.
What happened?
At around 2pm yesterday afternoon a segment of our main network was sluggish and people would have experienced latency and packet loss.
Why?
As you may know our main network is firewalled. We have a pair of firewalls setup in HA (high availability) to protect the bulk of our clients, which includes all our shared hosting clients on both windows and linux, as well as a large number of clients on dedicated servers or with colocated machines.
Firewalls are basically computers. Depending on how much money you want to spend on them you get different capabilities. While our firewalls are perfectly adequate under most conditions they have limits.
When a server behind the firewall was compromised and started pumping out large amounts of traffic the firewalls were pushed to capacity. While the network was up at all times it would have been slow and unresponsive until our engineering team were able to take action.
What action was taken?
The server that had been compromised was disconnected from the network until the issue had been resolved / removed.
How can we avoid this in the future?
We had been planning to upgrade the firewalls in any case, this is now being moved forward. The new firewalls will be able to carry larger amounts of traffic so this kind of issue will have a lower impact should it arise again.
For the last few months we have also been actively encouraging clients to opt for their own firewall(s).
And now for the more detailed breakdown:
Outage Information with Timeline of Events
13:53 C program downloaded onto a customer’s machine via a hole in their
programming code.
13:55 Code compiled and executed. A result of this was 80mbit/s of
additional traffic heading towards the shared firewall service during peak lunch time traffic.
14:05 Our engineering team noticed latency of SSH and terminal services connections to machines on the network behind the firewall were laggy or intermittent.
14:06 Senior onsite engineers begin to investigate the issue.
14:08 One of our external traffic links was carrying approx 50mbit/s more
traffic than normal (some traffic from the affected host never made it past the firewalls) and they begin to check access switches for which equipment cabinet has the infected host.
14:15 The host responsible for this increase in traffic was identified and
their switch port was shutdown by a network engineer.
14:16 Services begin to return to normal and the load on the firewalls CPU
drops back to acceptable limits.
14:18 All services are back to normal

Search for your perfect domain name...