BMS Doug
Posts: 3824
Joined: Thu Mar 27, 2014 2:42 pm
Location: London, UK

Re: What happened to British Airways computers?

Fri Jun 09, 2017 7:24 am

Something to look out for, make sure that the EPO's in any room you work in are fitted with some kind of cover to prevent accidental operation. It's not good if one gets accidentally operated.
Doug.
Building Management Systems Engineer.

User avatar
rpdom
Posts: 15604
Joined: Sun May 06, 2012 5:17 am
Location: Chelmsford, Essex, UK

Re: What happened to British Airways computers?

Fri Jun 09, 2017 1:58 pm

BMS Doug wrote:Something to look out for, make sure that the EPO's in any room you work in are fitted with some kind of cover to prevent accidental operation. It's not good if one gets accidentally operated.
Yes, we had one that was next to the switch to open the door to get out of the main room. That caused a couple of issues until a cover was put over it.

User avatar
rpdom
Posts: 15604
Joined: Sun May 06, 2012 5:17 am
Location: Chelmsford, Essex, UK

Re: What happened to British Airways computers?

Fri Jun 09, 2017 2:12 pm

How about this scenario that I witnessed:

Imagine a medium sized datacentre. Most of the equipment is on a pair of UPSes driven by a room full of batteries. There are also backup generators outside. They cut in after the UPS has been running for more than 5 minutes (although we had enough batteries for several hours). If power came back the generators didn't shut down straight away, they kept running for 30 minutes in case power went again.

What actually happened: Power failed. UPS cut in. Generators started. Power came back on. Generator 30 minute timer started. Power went off again. Generator timer kept running. Generators shut down...

Fortunately the UPS cut back in when the generators stopped and everything kept running until power came back properly. Generator software was shortly upgraded.

gordon77
Posts: 4310
Joined: Sun Aug 05, 2012 3:12 pm

Re: What happened to British Airways computers?

Fri Jun 09, 2017 2:28 pm

Nobody did a proper acceptance test when it was installed, and noticed this snag I assume..
rpdom wrote:How about this scenario that I witnessed:

Imagine a medium sized datacentre. Most of the equipment is on a pair of UPSes driven by a room full of batteries. There are also backup generators outside. They cut in after the UPS has been running for more than 5 minutes (although we had enough batteries for several hours). If power came back the generators didn't shut down straight away, they kept running for 30 minutes in case power went again.

What actually happened: Power failed. UPS cut in. Generators started. Power came back on. Generator 30 minute timer started. Power went off again. Generator timer kept running. Generators shut down...

Fortunately the UPS cut back in when the generators stopped and everything kept running until power came back properly. Generator software was shortly upgraded.

BMS Doug
Posts: 3824
Joined: Thu Mar 27, 2014 2:42 pm
Location: London, UK

Re: What happened to British Airways computers?

Fri Jun 09, 2017 3:11 pm

rpdom wrote: What actually happened: Power failed. UPS cut in. Generators started. Power came back on. Generator 30 minute timer started. Power went off again. Generator timer kept running. Generators shut down...

Fortunately the UPS cut back in when the generators stopped and everything kept running until power came back properly. Generator software was shortly upgraded.
Ouch. As Gordon77 noted, that's the price of not testing.

I've been somewhere where they over-tested, 15000 load failure scenarios were tested over a 3 month period. Equipment started failing due to the number of power interruptions...
Doug.
Building Management Systems Engineer.

User avatar
rpdom
Posts: 15604
Joined: Sun May 06, 2012 5:17 am
Location: Chelmsford, Essex, UK

Re: What happened to British Airways computers?

Fri Jun 09, 2017 3:15 pm

gordon77 wrote:Nobody did a proper acceptance test when it was installed, and noticed this snag I assume..
I don't know, not my department, but I assume they tested that the UPS worked on power fail, that the generators cut in on time and that the generators cut back out after 30 minutes of the return of stable power. I guess no one thought about making the timer stop if power failed again.

There were other occasions where things did go wrong, like the time there was building work going on in the UPS/battery room and the builders carefully wrapped up the UPS so no debris could get inside it. Wrapped it very tightly, including the EPO button... Management response was "why is EPO button so accessible? Can't we put a cover over it?" Answer was "No, because it is easy to use (in that locked room) for a reason. Someone could easily get killed".

TudorJ
Posts: 44
Joined: Thu Jul 09, 2015 8:41 pm
Location: Cockfosters

Re: What happened to British Airways computers?

Fri Jun 09, 2017 3:30 pm

Ther is an old saying:
The worse problems happen when fail safe systems fail to fail safe.

User avatar
buja
Posts: 507
Joined: Wed Dec 31, 2014 8:21 am
Location: Netherlands

Re: What happened to British Airways computers?

Fri Jun 09, 2017 4:04 pm

gordon77 wrote:Nobody did a proper acceptance test when it was installed, and noticed this snag I assume..
Was this case recognized during specification? If yes, than it should have been caught during acceptance testing.
If no, the specification failed, and subsequently all other phases did not include it.

Other, more simple points of failure: you have a UPS, but the batteries are dead because of lack of maintenance. The generators run out of fuel, or the fuel lines are blocked (a serious problem with bio fuel).

User avatar
rpdom
Posts: 15604
Joined: Sun May 06, 2012 5:17 am
Location: Chelmsford, Essex, UK

Re: What happened to British Airways computers?

Fri Jun 09, 2017 5:10 pm

buja wrote:
gordon77 wrote:Nobody did a proper acceptance test when it was installed, and noticed this snag I assume..
Was this case recognized during specification? If yes, than it should have been caught during acceptance testing.
If no, the specification failed, and subsequently all other phases did not include it.

Other, more simple points of failure: you have a UPS, but the batteries are dead because of lack of maintenance. The generators run out of fuel, or the fuel lines are blocked (a serious problem with bio fuel).
One other incident that happened at another datacentre was where everything suddenly went dead. People asked "Why didn't the UPS cut in?". Upon investigation it was found that the UPS did cut in - 24 hours earlier, but the alerts that the UPS was running didn't sound or notify anyone. The whole site had been running on batteries for 24 hours until they failed (no generators allowed at that site).

Heater
Posts: 13924
Joined: Tue Jul 17, 2012 3:02 pm

Re: What happened to British Airways computers?

Fri Jun 09, 2017 6:58 pm

When I worked for Marconi Communications in Portsmouth they had a big generator in a big building of it's own that could supply the entire site in the event of loss of grid power.

One day something exploded in there and one it's steel chimneys took off like a rocket.

The grid was up so no worries.
Memory in C++ is a leaky abstraction .

BMS Doug
Posts: 3824
Joined: Thu Mar 27, 2014 2:42 pm
Location: London, UK

Re: What happened to British Airways computers?

Fri Jun 09, 2017 8:59 pm

(Long ago) One building I was working in fell over when loadshedding plc lost its memory, of course the UPS kept the CER rooms up long enough for a controlled shutdown but the building had to shut down for a couple of hours.

Once the plc had been reloaded with the correct program we were examining the revised setup for any remaining single point of failure issues. The PLC engineer assured us that there weren't any, we thought pf a possible risk and he assured us that it wasn't and reached out to test it to prove his point and was very surprised at the panicked yells of us stopping him (just in time).
we arranged with the tenant a time to test the scenario and the loadshed plc tripped out the building again. A very close run thing.
Doug.
Building Management Systems Engineer.

Pi-holeDevDan
Posts: 5
Joined: Wed May 03, 2017 11:52 pm

Re: What happened to British Airways computers?

Fri Jun 09, 2017 9:09 pm

Or fun with users:

UPS installed at every workstation, server and VOIP handset. Power fails, and workstations/servers crash hard. (Takes out a few.)

Questions abound, why did everything fail? Answer? A worker there during a smaller power outage didn't like hearing constant beeps coming from some black boxes, so he turned them all off.

User avatar
DougieLawson
Posts: 36578
Joined: Sun Jun 16, 2013 11:19 pm
Location: Basingstoke, UK
Contact: Website Twitter

Re: What happened to British Airways computers?

Fri Jun 09, 2017 9:27 pm

BMS Doug wrote:Something to look out for, make sure that the EPO's in any room you work in are fitted with some kind of cover to prevent accidental operation. It's not good if one gets accidentally operated.
That happened when I worked for a bank. There were two machine rooms (2nd floor & 3rd floor) with linked EPOs. The 3rd floor had the mainframe. The 2nd floor had the 3890 Cheque Reader/Sorters. One of the reader/sorter operators threw a chair at a friend (as you do) and that hit their EPO. The big problem was that made the whole building go quiet. They disconnected the two floors after that and added lift before hitting covers.
Note: Having anything humorous in your signature is completely banned on this forum. Wear a tin-foil hat and you'll get a ban.

Any DMs sent on Twitter will be answered next month.

This is a doctor free zone.

mikerr
Posts: 2789
Joined: Thu Jan 12, 2012 12:46 pm
Location: UK
Contact: Website

Re: What happened to British Airways computers?

Fri Jun 09, 2017 10:04 pm

This (admitted random anonymous comment) seems most likely to me:
BA has a DR site independent of the primary that suffered the power issue. But volume groups were not being mirrored correctly to the DR site. When they brought the DR site online, they were getting 3 or more destinations when scanning boarding passes. And since the integrity of the DR site was an issue, it could not be used.

Then the only option is to fix the primary DC, which would have involved installing new servers / routers / switches / etc, configuring them, restoring the data to the last known good state and then bringing it back online. Good luck to anyone trying to deploy new/replacement equipment en masse during the chaos of a disaster. And then restoring data!

Takes days, not hours... unlike whatever RTO/RPO they claimed to be able to meet.
https://hardware.slashdot.org/story/17/ ... er-problem
Android app - Raspi Card Imager - download and image SD cards - No PC required !

W. H. Heydt
Posts: 11111
Joined: Fri Mar 09, 2012 7:36 pm
Location: Vallejo, CA (US)

Re: What happened to British Airways computers?

Sat Jun 10, 2017 2:16 am

rpdom wrote: I don't know, not my department, but I assume they tested that the UPS worked on power fail, that the generators cut in on time and that the generators cut back out after 30 minutes of the return of stable power. I guess no one thought about making the timer stop if power failed again.
Seems to me that, if the power goes out again during the 30 minute stabilization run, the timer should reset to 30 minutes when it comes back on.

W. H. Heydt
Posts: 11111
Joined: Fri Mar 09, 2012 7:36 pm
Location: Vallejo, CA (US)

Re: What happened to British Airways computers?

Sat Jun 10, 2017 2:36 am

A couple of things that happened at one company I worked at...

Or IT shop was on the 13th and 14th floors of One Embarcadero Ctr. in San Francisco, a 45 story building. About half of the 14th was the machine room and K/P section. The mainframes were water cooled IBM.

Incident 1: During construction of a new building across the street, while putting in tie backs for the foundation excavation, a drill operator hit a 16" gas main. There was no Earth Shattering Kaboom! but gas was sucked into the build A/C air intake and we had an explosive concentration in the machine room. Other related problems were from spraying our work spaces to neutralize the odorants added to the gas...which smelled even worse that the stuff in the gas. Residual oils from the pumps got into the lines and the oils were contaminated with PCBs. Part of the environmental clean up was to do a hazmat scrub down of several buildings in the area including the one we were in.

Incident 2: The cooling water for the mainframe ran off the building HVAC. On nights and weekends, when the main HVAC was shut down, there was an auxilliary chiller on the roof, plus a 10,000 gallon tank to hold the extra water. In out machine room, there was a water distribution unit, fed by a 1.5 inch copper line, with fittings soldered (per spec) 1.25 inches deep. The return drain lines were above the above the acoustical tile ceiling on the 14th floor--so they had drilled holes through the floor on the 14th. One Thursday evening, a joint on the feed line separated (it was a manufacturing defect). No one had ever told the operators where the shutoff valves were (they were under the machine room raised floor...and were marked afterwards). The net result was to drain the 10K gallon tank (*30* floors of pressure head...figure about 12 atm, or about 180 psi). There was water damage for 5 floors below up. Some of the water ran down a stairwell and got into a main power duct and blew out a two story high segment of bus bar (power was a right half/left half arrangement, so that bar could handle half of the pwoer for a 45 story office building). A new bar segment was airfreighted in from the closest location one could be found: Chicago. We were down for 4 days.

jardino
Posts: 129
Joined: Wed Aug 08, 2012 9:03 am
Location: Aberdeenshire, Scotland

Re: What happened to British Airways computers?

Sat Jun 10, 2017 7:45 am

Not strictly a fail-over story, but here goes:

I was once designing a multi-computer system for the Port of Rotterdam in the early 80s. The system was supposed to run 24/7 (it monitored shipping in the Port).

At the time, we had to take the central database down for an hour every week. At first the client grumbled, but then said we could do it around 1 o'clock on Sunday mornings, when nothing moved in the port - the sailors all being onshore in the bars and other places of ill repute.

Later, I was assigned to design a similar high-uptime system for a Police force in the north of England. Having the same database problem, I suggested the same time of the week for taking the database down. The senior Police officer who was our client just stared at me for a minute then said, "You must be <expletive deleted> daft. That's our busiest time of the week!" (Sailors coming out of bars, etc.)

Alan.
IT Background: Honeywell H2000 ... CA Naked Mini ... Sinclair QL ... WinTel ... Linux ... Raspberry Pi.

Heater
Posts: 13924
Joined: Tue Jul 17, 2012 3:02 pm

Re: What happened to British Airways computers?

Sat Jun 10, 2017 8:00 am

Keeping a system running reliably in the face of any kind of failure of it's parts, be it software crashes, power outages, communication failures etc, is a non-trivial problem.

See the papers on the Byzantine fault tolerance:
https://en.wikipedia.org/wiki/Byzantine_fault_tolerance

Leslie Lamport showed that in order for a system to reliably survive one error in it's parts the system need 4 independent nodes.

In general, being able to tolerate n faults at the same time requires 3n + 1 nodes.

https://www.microsoft.com/en-us/researc ... %2Fbyz.pdf

Not many people do that.

Heck, even the fly by wire system of the Boeing 777 does not meet that requirement.
Memory in C++ is a leaky abstraction .

User avatar
Paul Webster
Posts: 812
Joined: Sat Jul 30, 2011 4:49 am
Location: London, UK
Contact: Twitter

Re: What happened to British Airways computers?

Sat Jun 10, 2017 8:03 am

mikerr wrote:This (admitted random anonymous comment) seems most likely to me:
BA has a DR site independent of the primary that suffered the power issue. But volume groups were not being mirrored correctly to the DR site. When they brought the DR site online, they were getting 3 or more destinations when scanning boarding passes. And since the integrity of the DR site was an issue, it could not be used.

Then the only option is to fix the primary DC, which would have involved installing new servers / routers / switches / etc, configuring them, restoring the data to the last known good state and then bringing it back online. Good luck to anyone trying to deploy new/replacement equipment en masse during the chaos of a disaster. And then restoring data!

Takes days, not hours... unlike whatever RTO/RPO they claimed to be able to meet.
https://hardware.slashdot.org/story/17/ ... er-problem
and a later anonymous statement says:
"The interesting part is that this part of the problem started happening on FRIDAY - around 18 hours BEFORE the total outage caused by the supposed power outage/surge."

sarahgad
Posts: 30
Joined: Fri Jan 20, 2017 12:07 pm

Re: What happened to British Airways computers?

Mon Jun 19, 2017 9:50 am

[quote="Heck, even the fly by wire system of the Boeing 777 does not meet that requirement.[/quote]

This comes as a surprise from someone big like Boeing

Return to “Off topic discussion”