If your server screams in the middle of the night and no one listens, does it matter?

I wanted to share with you one of the most vivid examples for why CSI's Paladin Sentinel approach to remote monitoring of your networks is superior to other solutions.

One of our public school district's using our Paladin Sentinel Monitoring service had their very expensive SAN at 10:15 pm one night suddenly throw off a large stream of very, very scary alerts with in the span of a few minutes. These messages effectively said that the SAN volumes and drives had been totally destroyed! If true, the school district was completely wiped out and they would have to implement their disaster recovery plan.

Suddenly the messages stopped and the SAN said it was happy again. In fact one of our engineers was on it an hour later working on an unrelated project and had no issues.

At around 6:30 in the morning we were reviewing the overnight automated alerts and developing a plan for issues we felt were outstanding. These horrible alerts had cleared. The automated system did its job. Nothing was presently wrong with the SAN. However, that honestly wasn't good enough for me. What I saw simply doesn't just go away. The school district should have been completely dead.

I remoted into the district. It was fine. I remoted into the SAN. It was fine. There were no errors and no warning directly visible in the console that something bad had occurred. However, when I went to the historical event log, it exactly mirrored what Paladin Sentinel had told us overnight. Something really bad and really scary had occurred - and then it mysteriously went away without explanation.

I picked up the phone and called the vendor. They were as scared as I was. What they saw was catastrophic. Things like that don't self-heal. Diagnostics were fine. I heard a lot of commentary like, "we have never seen this happen before" and "you should be really dead right now". The case was escalated to the next level and then finally, directly to the engineering team. The engineering team came back and said, "you should be dead right now" and that a firmware upgrade was required to address the issue that happened from repeating itself. They said a feature added to a previous version of the SAN firmware "actually worked" which they were quite surprised about because they said, "it almost never works". They were obviously scared and adamant that this work needed to be done ASAP.

We have successfully completed the vendor's recommended firmware upgrades.

If the school district had no monitoring solution at all, they would never know that they entire district was precariously teetering on the edge of an electronic catastrophe. And if they had a purely automated monitoring solution, they (best case) might have a self-healed alert on some report they would get in a month and probably never read, or (worst case) still wouldn't know about the impending disaster because their monitoring solution didn't speak to their SAN.

CSI's Paladin Sentinel Monitoring actually saw the pattern and our 37 years experience working with these technologies told us that even though everyone and everything said it was fine, it wasn't.

How do you know what you don't know? If this real scenario happened to you, do you have a plan? If not, contact CSI about our Paladin Sentinel Monitoring. Free trials are available..

If your server screams in the middle of the night and no one listens, does it matter?

Categories

Archives

Recent Posts

Subscribe!