The scenario is a familiar one. You’re about to step out for a leisurely day on the links, at the beach or in the backyard hammock and couldn’t be more excited about your time off, when the little red light starts blinking on your phone: URGENT. You pretend to ignore it, but the phone rings; and you know your afternoon is over.
Something broke, on some system, somewhere. And now you have to direct everyone to clean up the mess.
So how do we set ourselves up to keep those afternoons free? To keep the phone quiet no matter what?
A morbid, but important, joke among many IT professionals is the "bus number": How many and which employees, if taken out by a bus, would result in complete chaos across your business? In other words, where is your knowledge center; who can’t you do without? A smart company won’t have a simple answer to this, or at least the response will be "a whole lot of people."
IT knowledge should never reside in one person’s head; it must be spread among the team and written down as clearly as possible. With the crush of the day-to-day, finding the time to record down structures and protocols can be almost impossible. You’ve got to make that time—even if the recording and subsequent meetings mean staying a little later during the workweek. Taking that knowledge out of the confines of a single head and spreading it among the team will only pay off later.
The moral: You likely have redundant systems set in place in case of a technical failure; employ redundancy across your staff members as well.
Once confident that knowledge is spread sufficiently among team members, test the theory. Devising internal exercises can be effective and cheaper, but often the more successful move (and yes, more costly) is hiring independent consultants to hack or poke holes in your infrastructure.
Buy everyone pizza, pay double overtime and take a weekend to train your team to solve problems and emergencies. It will pay off, for you, your team, and most importantly, your company.
And as you educate your staff, intermittently remove yourself and other team members from the drills. If you’re simulating an email issue, hold back a couple key members of your email support team; do the same across other systems. Highlight the human fail points, whether it is a team member or knowledge set, and continue devising a map to strengthening them.
Approaching problems with an incomplete team will not only teach the remaining members to cope, but also will allow those members sitting on the sidelines to observe the issue—a perspective that generates fresh angles on how to solve and plan for problems.
Often the worst IT disasters don’t come from anything operating on electricity, 1s and 0s, or faulty code—but from "natural causes." A water pipe bursts. The power goes out. Someone drops muffin crumbs through a server vent. Systems can be backed up to alleviate these problems, and often the issues are wholly preventable from the start.
Knowing the design of your buildings is as important as knowing the design of your systems. For small companies, the kitchen can be an appealing place to locate servers; the area is better equipped for higher power demands. The caveat is that the room is also heavily piped with water and people are carting food in and out all day. It’s always better to keep servers away from human traffic.
Going through the blueprints of your building and doing a walk-through with the owner (if it’s not you) will start to reveal the other natural fails around the office. From there you can move servers, put them on or under better protection. For testing power, reserve time to do live, full power cuts to your office and see how systems react. Backup generators and batteries have fail points too.
Like any team effort, IT measures require lots of practice. With enough rehearsal, you should be able to ensure that the game can still be played when the captain is on vacation.
By Caleb Garling