Why do so many things in our life break on holidays, nights, and weekends? Over the years I have had way too many pipes leak, hot water heaters spring a leak and furnaces die at the worst times.
The same is true for technology. Most of our clients have limited resources, staff, and budgets so weekends are generally left on "auto pilot" - hoping nothing bad happens and hoping they walk in the door Monday in an uneventful manner.
However, unfortunately for one of our larger customers Saturday came and very bad things started happening to their storage area network (SAN). This storage device contained most of their critical file servers and data. Without it their entire operation was dead - affecting about 12,000 users. The system in question has redundant systems for all of it critical parts along with extra spare drives to immediately, automatically replace a failed drive without user intervention to maintain maximum redundancy and performance. It is world class equipment.
On Saturday the first alert came. They lost a drive. It happens. The system immediately grabbed a spare and automatically fixed its issue as it was designed to do. Unfortunate, but everything working as designed. Then a second drive failed. Again the system automatically grabbed a spare and fixed the issue. Highly unusual, but again the system was designed to automatically deal with that. Then a third drive died! That is probably a once in a decade or more failure event. The spares were gone! It was now Sunday morning. Fortunately the system was designed with what is called a RAID set. It now had one less drive than it needed to be completely functional, but it was now using basic X x Y = Z algebra to re-create the missing drive data on the fly to keep the system running until a real drive could be both obtained and manually inserted.
That Sunday morning this absolutely vital piece of equipment was one failure away from a total collapse. The stark reality loomed that realistically everything that these 12,000 users relied on would simple cease to exist. That meant implementing the real disaster recovery plan and probably a couple of days of downtime to rebuild and restore everything. If anything else went wrong, organizational chaos would ensue as the user's struggled to get by without their critical systems.
And while all this carnage was going on the client was asleep unaware how close to destruction they were.
Fortunately our client subscribes to CSI's Paladin Sentinel Remote Monitoring service. This provides them 24x7x365 monitoring coverage of their critical systems. As these events unfolded through the weekend Paladin began notifying our technical staff of the events. As we watched the events continue into Sunday morning the alarms started going off and our technical staff quickly realized this was no routine failure and what was as stake for our client. Since we have emergency, after hours support, phone calls were made to the vendor and the emergency contact for our client. The SAN was under 4 hour on-site 24x7x365 service. However, this was so unusual a failure that the vendor only had two replacement drives available in the region on a Sunday. At 6pm on Sunday our systems engineer met the customer and the vendor courier on-site and swapped the two drives. The rebuild process started automatically. By 6:30 everyone was going home. Based upon the rebuild rate we estimated another 5 hours to fully re-establish redundancy. We weren't out of the woods yet. If anything happened I the next 5 hours, our client was still going to be dead. Our technical staff continued use our Paladin Sentinel Remote Monitoring system to remotely monitor the rebuild progress well into the evening until we were absolutely sure that they were completely safe. Then the third and final drive was replaced in a routine manner on Tuesday.
Because the customer used CSI's Paladin Sentinel monitoring what could have been an absolute disaster of huge proportions was a non-event. The 12,000 users came in on Monday in an uneventful manner blissfully ignorant of how bad things were over the weekend and how close they came to having a very, very bad and stressful week.
Whether you have 50,000 users or 15 that is what CSI does every day for all our customers big and small..
To find out more about how CSI can help you, contact us. |
Leave a comment!
You must be logged in to post a comment.