Before I worked here I wasn’t exactly sure what this kind of job consisted of. Of course this made for a very awkward résumé, since I had to get across that I knew a lot about computers without implying I was best suited for a completely different computer job, such as database management. But even for a while after I joined, there were moments when I was surprised, because what I had been expecting wasn’t anything like what actually happens. So I imagine there may be a lot of you who are speculative or curious about what kind of issues can happen in a data center.
All the issues that I have to deal with can be split into three categories, hardware, software and networking. Since we sell only unmanaged servers and colocation, ideally I’d only be responsible for hardware and networking (a fourth category, environment, is important but not in any way under my purview). However, operating systems being what they are, those things break all the time and of course I have to fix them.
Hardware
The hardware issues we see most often are bad RAM and bad hard drives. Very rarely we have bad RAID cards or NICs, and once I had to replace a CPU. These are always fairly easy fixes, once the problems have been identified. The only real issue is when a customer loses data due to a failed hard drive. RAID can sometimes (but not always) prevent that, which is one of a million reasons why backups are so necessary.
Under hardware I’m also going to throw all the scheduled upgrades we do. Fooling with hardware is the easiest part of this job, except for inventory management, which gets tres annoying, but even that isn’t so bad. This is the same monkey stuff you did for your family when you were 14.
Network
I don’t think I’ve seen the network actually break, but customers fall off it all the time. 95% of the time, this is because of Red Hat Linux. Oh man, do I hate Red Hat. Don’t take this personal, if you like, use, work for, or are Red Hat (well, take it personal if you are Red Hat), but the network configuration in this OS is such a mess. So if you use Red Hat, and you reboot, and suddenly you can’t get on the network, it’s because the network scripts, which used to work just fine, thank you, decided they didn’t like where the default gateway was defined, and now expect it to be defined in another of the 735 different network configuration files, which lives in another directory from the file previously used. Haha!
This is, of course, only my opinion.
Usually network upkeep involves protecting our network from customers. If customers get cracked, they tend to become members of zombie networks, and the UDP floods they send out can slow things down for other customers. We tend to head those off by limiting the compromised customer’s connection.
Less often, but not rarely, customers become victims of DoS or DDoS attacks. In fact, there’s one going on right now. If you happen to know 208.185.250.11, tell him I said to knock it off. There are nearly always handled automagically by our network infrastructure, but it’s good to keep an eye on it.
Software
Oh boy. Broken software. Where to start?
Well, there are the usual suspects. By default, Windows will only allow two active Terminal Services sessions at a time (Windows 2003 allows you to connect to the console remotely, which can count as a third session). If you run out of these, and Windows doesn’t reset them for some reason, We have to visit the box to reset them manually.
Control panels have been known to become unstable. This seems to happen when a user tends to be familiar enough with the command line to use that, but also has a control panel installed. The CP can become confused if a file is edited manually. This is why Ensim (for example) changes the motd to inform users that, if they edit files, they have voided their warranty.
Remotely upgrading OSs is also a tricky issue, for example kernel upgrades.
Then there are the day-to-day surprises, like that time up2date got confused and uninstalled OpenSSH.
So there are a myriad of different software issues that actually crop up, but the best way to classify them would be in two categories: those that break the OS and access to it, and those that break the services the server provides. We probably have an 90/10 split between them. Very rarely will we get involved in customer setups; our customers generally prefer to have their own IT staff take care of it.
In a way it’s almost disappointing that we don’t get to do the real Sysadmin work (that is, configure client servers with actual solutions to actual problems, instead of just making sure they’re online). But that would be impractical for the number of clients we have, and they’d basically be paying for our on-the-job training as we learned about their (unique, sometimes bizarre) setups. So probably it’s just as good we don’t.
Recent Comments