Summary:
@ 10:00am on Monday 31st December we experienced a DOS in the form of a
compromised host on our network in InterXion. This issue caused serious packet loss
for us on our core network in DEG. It also caused dns issues because the dns
servers couldn’t communicate with the mysql cluster that hosts this data.
There are 3 main causes for this issue.
1) Compromised host in InterXion flooded the core network with 100Megabits of traffic
2) Legacy dns scripts caused the include file for bind from our dns management
system to become empty and for dns to stop working.
3) The router upgrade last week left 1 link (which happens to carry all outbound traffic) to be auto negotiated at 100M instead of the usual 1000M.
The above 3 issues compounded each other and caused the 2 hour outage we
experienced this morning.
* To remedy the issue we’ve checked all core router -> core switch ports to ensure
they are all running at the correct speed.
* We’re examining the scripts that write out dns changes to ensure they don’t break
the bind includes if they are unable to contact the mysql cluster that stores this
information.
* We’re going to get an OOB(out of band) network access solution for all core
equipment including all switches in each location where we have core network. This
will allow us to access equipment from another non Blacknight network in the
event we loose connectivity or we’re having an issue such as we had today.
* We’ll also setup blacknightstatus.com which we’ll host elsewhere and ensure
that this is kept up2date with info as it becomes available to us regarding outages
and issues. We’ve had this in the pipe line for sometime but we didn’t feel it was
necessary until now.
Further updates with fault diagnosis will follow once we return back to full staffing
levels on Wednesday the 2nd of Jan 2008.
We apologise for this issue and the problems it has caused our customers.