This is a special edition of CCC-News with information about why there was a 30 hour outage on email and web hosting.
In this edition:
30 hour server outage – what happened…
LG phone not so good on NextG?
The Computer of Doom!
30 hour server outage – what happened…
On Wednesday evening, the server that handles auzzie.net’s email, websites, and about 30 other websites hosted by CCC, went off the air.
I had received an email shortly before advising that a security upgrade was to be performed, so was not concerned, as this is usually a quick process. Not this time!
It turned out that the server had been infected by a special kind of virus known as a root kit. This makes the virus virtually invisible.
Additionally, this bug appears to be related to a new virus or worm that is doing the rounds, and even has the experts of the world baffled about where it came from and what it does!
The good news for now is that the server has been completely rebuilt, and upgraded to protect itself as best as the admins know how, from this bug.
Here is the explanation we received from WebHostsAustralia who manage the higher-level administration of the server. We’re on the Sydney server mentioned here…
Now that we have been pretty well been able to return normal services I would like to give a run down on what has been causing the problems on the server
Just prior to xmas we noticed that the server had ceased to be able to create files or directories that were all numerical and 2 clients reported they were getting alerts from their anti virus programs when visiting their sites
These alerts however were not showing up for all people viewing the sites in fact none of us could see them and I for example had just upgraded to norton 2008 a few days before on this machine
At the same time one of the clients running norton 2005 was getting the alerts
After further investigation we established that the alerts were only showing up for some people using Internet Explorer and not Firefox or other browsers
We ran a full virus scan using a couple of different systems on the server and they showed no problems so we started searching google for anything related
While we did find the odd report of others having the same problem there was no indication of anyone knowing exactly what was causing the problems or of any solution
We therefore decided to to instigate the server move in order to have a completely fresh install of all software on new hardware in a different data centre
This was the move at xmas and for 3 days everything ran fine before the problem re appeared
We then contact Centos (the operating system which runs on the server) cPanel and a server security company. None of them could identify the problem or suggest a solution
4 days ago we finally found some information as to what was causing the problem
>From the links below you will see that even today no one including the top Internet security companies knows how it is getting in to servers or how it can effectively be stopped
and that only 3 out of the top 33 anti virus programmes can even pick it up
As you may know we have our own hardware and racks in Brisbane along with servers we lease in 4 data centres in the US while this server in Sydney is leased from a company down there
Overnight on Tuesday we discovered one of the servers in Brisbane and one in the US had been attacked in the same way
By now we had been able to identify a company in Scotland with the ability to remove the infection from the server and we commissioned them to clean the 3 servers
As you may know we have our own hardware and racks in Brisbane and have complete access to them and the data centres in the US are fully and permanently staffed however this server in Sydney is leased from a company down there who do not have personnel in the actual data centre
In order for the Scots to access the server we needed a centos disk installed in the server and access set up for external log in by the techs
With the US and Brisbane servers we were able to set this up right away and the necessary work was completed in under an hour per server with no interruption to service
We submitted a request to our server provider in Sydney at 6am on Wednesday for this access to be set up
For some reason they did not treat this as a priority despite repeated communications to that effect from us and it was only in the late afternoon it was finally actioned
The security company then accessed the server and performed the fix however when they went to restart the server to complete everything it would not come back up
Apparently when the person was setting up the access for them they had somehow managed to delete the contents of the /boot partition on the server
This left us with a dead server and noone in Sydney who could physically go in and reinstall the necessary software
Because remote access was still set up we sourced a company in India (due to the time difference) who swore they could fix it “no problems”
Well they clearly couldn’t and as a result we wasted 15 hours as a result
Our options at this point were to try and find someone else who could access the server and add the software to get it back up or wipe the server completely and reinstall everything from scratch
Because such a re install would take 12 – 18 hours I was still holding out that we could get it fixed and avoid this
We got the Scottish techs to have a look and they thought they could get it back up but in order to do so the process involves the disk in the server to be removed and inserted at various intervals which proved to be very time consuming as someone had to be organised each time to do this
In the end I had to make the decision to cut our losses and re install everything
Steve has now worked overnight to do this and by now all sites should be back up and running although a few may still require some final tweaking to be at 100%
We originally obtained the server in Sydney as a one off to run our server monitoring system for our US and Brisbane servers from a remote location and as there was a lot of space free on the server we decided to offer Sydney based hosting on it
As a result of this incident I am set to terminate the hosting from this server and move all sites to a new server I am setting up in Brisbane today
The main reason for this is our necessity to rely on third parties should a problem arise with the server in Sydney as opposed to here in Brisbane where we can have one of our own staff in the Data Centre in approximately 15 minutes and logged into a server with a console
We will still be maintaining a server in Sydney to run our monitoring and these forums from but we would like to move all client sites to Brisbane as soon as people are ready
Given the level of connections within Australia there should be absolutely no noticeable difference in performance but there will be the security of greatly increased service in the event of any future issues
However should anyone wish to remain on this server they can do
Once Steve and Michael have completed the remaining work on the server restore they will be contacting all clients on the server however should anyone not wish to wait please just pop a ticket in the help desk and we will start things rolling
If you have not done so already please have a read of the links I posted above as it is well worth seeing what we were up against
Note: No financial or physical contact information is stored on this server.
During this outage I sent updates to affected customers via Windows Messenger and SMS. If you’d like to be on the list for any future issues, please write back with a mobile number or Messenger address that you would like to have used for this purpose.
In a future Newsletter I’ll talk about Gmail – and how you can use it to kill junk mail from other addresses. For now I’ll just mention that it’s a good idea to set up a gmail address as a backup that you can use if you’re ever in need of one. It’s free from http://gmail.com/ – It’s a bit like hotmail, but without the ads, etc.
LG phone not so good on NextG?
Unpleasant surprise number two for this week was the LG TU500 mobile puchased on eBay, in an attempt to get hands free working better n the car. It was not good – no buggger could hear me for a start, and the phonebook could not be transferred either; but that was not the worst of it…
I kept getting missed call messages – you know, the ones you have to pay through the nose to retrieve. I did some ‘scientific’ tests at home using two NextG phones in my pocket at the same time, calling both from a landline. The cheaper ZTE F252 phone worked each time, but the LG didn’t. Yet when on a desk they both showed equivalent coverage.
So, while the LG is maybe the better phone featurewise and possibly for outright coverage strength, the cheaper ZTE is more reliable when you’re teetering on the edge! The ZTE doesn’t wander off as much, or if it does, it is faster to re-register.
Notes: this was only a test with two handsets, so may not be scientifically accurate. Also, the TU500 was superseded by the TU550 recently. Both tested phones were in auto 2g/3g mode.
The Computer of Doom!
One of the newer capabilities of the upgraded Kingswood wiring is the ability to run a desktop computer direct from the battery. This rarely used feature is occasionally useful when some small task needs to be done, but I have no office available and don’t want to drive 15-30km to the home office.
It had it’s first official use Wednesday when a Glen Innes customer arranged to have a new modem installed after lightning blew up their existing one. (We met up in Guyra.)
So I plugged it in, turned it on and … nothing happened. No power light. Nothing. (Unpleasant surprise number three – they hadn’t at this point told me that it wouldn’t turn on – only that the phone line was dead if they connected the computer.) They’d wandered off for lunch, so I couldn’t reach them; so I left a note on their car that I’d be an hour, and headed back to the home office.
Theirs was an ancient computer bought second hand two years ago when I was still teaching at GALA, so worth ‘bugger all’ – and I have a shed wall of similar machines; but strangely each one I put their hard drive in – the part of a computer that holds all the information – it would not turn on. By the time I was on the third ancient (and a bit dusty) computer it finally dawned on me…
… The lightning had done so much damage that it had shorted out the hard drive! ..and the CD drive too.
That drive had almost killed three computers, but fortunately after a break of a few minutes they all recovered.
So, in this case all their stuff is probably gone. They haven’t asked for it to be recovered, which may be possible by swapping parts of the drive, for which purpose I hold a stash of old drives. The in-car power was useful however in setting up their email account on the $15 replacement.