This morning, we had a fibrechannel adapter go flaky in one of the GS160s that is now supporting the production cluster. By the time the system had finished dumping memory and I had diagnosed the cause of the crash, it was too late to replace the card, as this necessitates shutting down the other Galaxy instances on this GS160.
This is the downside of consolidation. For the benefit of being able to dynamically reassign CPUs, and the speed that shared memory brings to cluster communications, you have to put up with not being able to correct a hardware issue unless you shut down the entire hard partition. In this case, the entire machine.
However, with the increased memory and CPU that the machines supporting production now have, our end users haven't even noticed the machine being down.
Posted at August 24, 2005 1:40 PMComments are closed