DougieLawson wrote:Heater wrote:Now they want blame some guy for pulling the plug, and plugging it back incorrectly.
There's a somewhat rude word that starts with a "B", ends with a "t" and has "ullshi" in the middle to describe Willy Walsh's attempt to keep his c-level job and to avoid IAG/BA having to pay €600 penalty to every passenger (into or out of the EU) that was affected by their critical failure.
Every major data centre has UPS, every major computing system has hot-standby (or your money saving efforts are cutting off noses to spite faces). The problem is more often the mass of "stuff" between the data centre and the worker typing stuff on their screen at the airport check-in or baggage drop that may have to re-establish its network connection to the data centre on failover.
If one lone contractor can cause an critical failure then the problem lies in their hardware planning, their data centre access, their "four-eyes" buddy checking and all of that stuff that should take place to ensure reliability and continuity. That again becomes Willy Walsh's problem if the IAG/BA processes are not fit for purpose.
I completely agree, all critical systems are supposed to be designed to avoid a single point of failure, I can only see two scenarios in which you would lose everything:
Inadequate design.
Deliberate sabotage.
Data centers that I have worked in have Power Distribution Units (PDU) powering the equipment in each Equipment room. (Each piece of equipment would only be fed by one PDU).
The PDU has two power feeds, mains and UPS.
In the event of mains failure the PDU switches seamlessly to UPS until the generator systems have kicked in and the mains circuit is back up.
Once the incoming power supply is restored the generator can be manually switched back to mains.
The Data center would have 2-3 UPS systems, two incoming mains power supplies, and redundant generators (usually 1-2 more generators than required for full building load).
If the Equipment is correctly set up I would expect it to be spread out between multiple data centers, failing that it would be split between multiple equipment rooms within the same data center. I would not expect all of the equipment to be in a single equipment room, on a single PDU or on a single UPS system.
If all of the equipment (or enough of it to cause a single point failure) was in one room then the EPO (emergency power off, Big Red Button near the door) could take it all down in one hit. This fits the scenario described (contractor turned it off then quickly turned it back on again)
If all of the equipment was on a single UPS system then a power fluctuation while that UPS was in bypass mode (for servicing) would have a similar effect.
Doug.
Building Management Systems Engineer.