Page 1 of 2

What happened to British Airways computers?

Posted: Mon May 29, 2017 1:22 am
by ab1jx
The BBC app just says it was some kind of a global power failure, which is none too technical and doesn't make much sense. 1000 or so flights grounded over a couple days, and I gather it wasn't even caused by a Windows problem.

Now the BBC is saying a third day, but how could they be so crippled? http://www.bbc.com/news/uk-40081112

There used to be a usenet news group about the risks of adopting computers, this is the sort of stuff they covered. I don't have usenet access any more.

Re: What happened to British Airways computers?

Posted: Mon May 29, 2017 8:20 am
by bensimmo
But three days out would be no time or probably money compared to what they have gained by throughput from using computers.

Bank holiday weekend and probably one of the many grumpy striking style workers switched a plug off ;-)

Re: What happened to British Airways computers?

Posted: Mon May 29, 2017 10:10 am
by Ernst
Nothing special, you will see this more often in the future.
I have read the bbc article and I can fully understand why things like this happen because it is a new trend in the IT world to cut costs and offshore to "lower" cost countries and at the same time releasing experienced IT experts into retirement.

Re: What happened to British Airways computers?

Posted: Mon May 29, 2017 10:21 am
by B.Goode
What happened to British Airways computers?
We don't know. The people who might know are either not making it public, or maybe don't fully understand yet.

In the absence of informed reliable statements about the underlying cause anything else has to be regarded as conjecture (politely) or as "Fake News".

Re: What happened to British Airways computers?

Posted: Mon May 29, 2017 7:14 pm
by hippy
B.Goode wrote:In the absence of informed reliable statements about the underlying cause anything else has to be regarded as conjecture (politely) or as "Fake News".
Indeed. BA CEO Alex Cruz is saying there was a brief power surge, a backup system which did not kick-in at the time but was restored later. But there has been no explanation as to how that evolved into the outcome witnessed so far.

http://www.bbc.co.uk/news/uk-40083778

Re: What happened to British Airways computers?

Posted: Mon May 29, 2017 8:26 pm
by DavidS
A better question is what the customers were thinking. The numbers presented show an order of magnitude more effected customers in one day than should be served by all airlines worldwide combined in a single day (based on a rough estimate of a world population of 12,000,000,000). So what were people thinking to cause that many people to be taking commercial flights at that time?

Re: What happened to British Airways computers?

Posted: Mon May 29, 2017 9:00 pm
by Heater
12,000,000,000? Where did you get that from?

The world population is ridiculously huge but I think we are only up to seven and a half billion.

In the states it's a long weekend holiday so I guess that accounts for a lot of extra traffic.

Re: What happened to British Airways computers?

Posted: Mon May 29, 2017 10:06 pm
by drgeoff
Heater wrote: In the states it's a long weekend holiday so I guess that accounts for a lot of extra traffic.
And in the UK. The replacement for Whit Monday. https://en.wikipedia.org/wiki/Whitsun

Re: What happened to British Airways computers?

Posted: Mon May 29, 2017 11:47 pm
by mahjongg
British Airways, like most sensible airways, probably uses a mainframe computer. They certainly won't use a low end consumer OS like Windows for a mission critical system (I hope).
But even a mainframe can develop a glitch.

Re: What happened to British Airways computers?

Posted: Tue May 30, 2017 3:34 am
by peterlite
If it is a centralised system and breaks, recovery is not automatic and requires the recovery of everything. If their system ran on 500,000 Pi 3Bs, they would be recovering only one and all the other transactions/flights would be unaffected.

Recovery is a b*tch. This is where you find out all the changes made after the last test of the recovery procedure. Your IBM mainframe recovery procedure, "Press button B", assumes:
* You are using staff in London, not Pune, India.
* An IBM mainframe in London, not Lenovo servers in the Diaoyu islands.
* Z390, not DragonflyBSD, which is only half implemented as part of a conversion from MirOS.
* Your backups will restore despite using a new encryption system that is not yet tested through decryption.
* The decryption passwords will be available when that IT guy is out of a coma from that thing with a bus.
* You can always revert to the tape backups made on the tape drives you dumped in the trash.

Re: What happened to British Airways computers?

Posted: Tue May 30, 2017 4:29 am
by DougieLawson
mahjongg wrote:British Airways, like most sensible airways, probably uses a mainframe computer. They certainly won't use a low end consumer OS like Windows for a mission critical system (I hope).
But even a mainframe can develop a glitch.
They used to use a mainframe. It's not clear if they've moved off that to X86 blades running Windows Server.

Mainframes tend to have uninterupptable power supplies that get tested. They also tend to have disaster recovery hot standby systems that get tested.

Re: What happened to British Airways computers?

Posted: Tue May 30, 2017 4:39 am
by DougieLawson
peterlite wrote: * Z390, not DragonflyBSD, which is only half implemented as part of a conversion from MirOS.
The zSeries operating systems are z/OS, z/VM, z/VSE, zLinux or zTPF.

No such thing as z390. No such hardware either. Current top end zSeries mainframe is a z13.

There's no more IBM Blue paint either.
Image

Re: What happened to British Airways computers?

Posted: Tue May 30, 2017 7:24 am
by bensimmo
drgeoff wrote:
Heater wrote: In the states it's a long weekend holiday so I guess that accounts for a lot of extra traffic.
And in the UK. The replacement for Whit Monday. https://en.wikipedia.org/wiki/Whitsun
Half term too, start of it for many areas of the country. So lots of families heading off on holiday all at once.

Re: What happened to British Airways computers?

Posted: Tue May 30, 2017 10:11 am
by RaTTuS

Re: What happened to British Airways computers?

Posted: Mon Jun 05, 2017 10:00 pm
by ab1jx
Well I was shocked. I've known about database replication for 15 years or so, it was hard to understand why flights all over the world were being affected. A power surge? Worldwide? Over leased lines or something? I mean this is the era of Bitcoin and huge blockchains that there are many copies of. Their route map shows most flights going through Heathrow but I thought their computers would be more distributed.
ARM-BA.gif
ARM-BA.gif (55.5 KiB) Viewed 4121 times
Lately these seem to have surfaced but I gather there's still some mystery involved and investigations are underway.
http://www.bbc.com/news/technology-40118386
http://www.bbc.com/news/business-40159202

Re: What happened to British Airways computers?

Posted: Tue Jun 06, 2017 6:53 am
by peterlite
Maybe they used PoHTTP to power their network. Someone unplugged the TP-Link router in head office. They no longer had packets of power going out to other devices. System down...

Someone should have told them about the Pi Zero and how Zero stands for Zero electricity use. :ugeek:

Re: What happened to British Airways computers?

Posted: Tue Jun 06, 2017 7:07 am
by RaTTuS
https://www.theregister.co.uk/2017/06/0 ... _analysis/
has a bit more info -
however er people error was the main cause
switch it off then back on again ... not good this time

Re: What happened to British Airways computers?

Posted: Tue Jun 06, 2017 7:24 am
by Heater
Now they want blame some guy for pulling the plug, and plugging it back incorrectly.

I don't buy it.

A human error like that is no different than some random hardware failure.

There is no way bringing down one part of your distributed system should cause total failure for days.

Oh, they did not have a distributed system.... well, that's not the guys fault now is it.

People like Facebook yank power on their data centers at random all the time. Just to see that everything keeps humming nicely.

Reminds me of the time I was working on the team testing the fly-by-wire Primary Flight Computers of the Boeing 777. Before the first flight of the 777 the test pilot climbed into the plane, yanked out all the circuit breakers and then restarted all the systems. Half of them did not come up again. Well, he was not flying that machine anywhere til that issue was resolved.

Re: What happened to British Airways computers?

Posted: Tue Jun 06, 2017 7:40 am
by RaTTuS
yes you cannot blame someone plugging it in wrong that your backup system is not working ...

Re: What happened to British Airways computers?

Posted: Tue Jun 06, 2017 12:05 pm
by S0litaire
this is an apt comic..
Image

Re: What happened to British Airways computers?

Posted: Thu Jun 08, 2017 6:12 am
by DougieLawson
Heater wrote:Now they want blame some guy for pulling the plug, and plugging it back incorrectly.
There's a somewhat rude word that starts with a "B", ends with a "t" and has "ullshi" in the middle to describe Willy Walsh's attempt to keep his c-level job and to avoid IAG/BA having to pay €600 penalty to every passenger (into or out of the EU) that was affected by their critical failure.

Every major data centre has UPS, every major computing system has hot-standby (or your money saving efforts are cutting off noses to spite faces). The problem is more often the mass of "stuff" between the data centre and the worker typing stuff on their screen at the airport check-in or baggage drop that may have to re-establish its network connection to the data centre on failover.

If one lone contractor can cause an critical failure then the problem lies in their hardware planning, their data centre access, their "four-eyes" buddy checking and all of that stuff that should take place to ensure reliability and continuity. That again becomes Willy Walsh's problem if the IAG/BA processes are not fit for purpose.

Re: What happened to British Airways computers?

Posted: Thu Jun 08, 2017 8:09 am
by BMS Doug
DougieLawson wrote:
Heater wrote:Now they want blame some guy for pulling the plug, and plugging it back incorrectly.
There's a somewhat rude word that starts with a "B", ends with a "t" and has "ullshi" in the middle to describe Willy Walsh's attempt to keep his c-level job and to avoid IAG/BA having to pay €600 penalty to every passenger (into or out of the EU) that was affected by their critical failure.

Every major data centre has UPS, every major computing system has hot-standby (or your money saving efforts are cutting off noses to spite faces). The problem is more often the mass of "stuff" between the data centre and the worker typing stuff on their screen at the airport check-in or baggage drop that may have to re-establish its network connection to the data centre on failover.

If one lone contractor can cause an critical failure then the problem lies in their hardware planning, their data centre access, their "four-eyes" buddy checking and all of that stuff that should take place to ensure reliability and continuity. That again becomes Willy Walsh's problem if the IAG/BA processes are not fit for purpose.
I completely agree, all critical systems are supposed to be designed to avoid a single point of failure, I can only see two scenarios in which you would lose everything:
Inadequate design.
Deliberate sabotage.

Data centers that I have worked in have Power Distribution Units (PDU) powering the equipment in each Equipment room. (Each piece of equipment would only be fed by one PDU).
The PDU has two power feeds, mains and UPS.
In the event of mains failure the PDU switches seamlessly to UPS until the generator systems have kicked in and the mains circuit is back up.
Once the incoming power supply is restored the generator can be manually switched back to mains.

The Data center would have 2-3 UPS systems, two incoming mains power supplies, and redundant generators (usually 1-2 more generators than required for full building load).

If the Equipment is correctly set up I would expect it to be spread out between multiple data centers, failing that it would be split between multiple equipment rooms within the same data center. I would not expect all of the equipment to be in a single equipment room, on a single PDU or on a single UPS system.

If all of the equipment (or enough of it to cause a single point failure) was in one room then the EPO (emergency power off, Big Red Button near the door) could take it all down in one hit. This fits the scenario described (contractor turned it off then quickly turned it back on again)

If all of the equipment was on a single UPS system then a power fluctuation while that UPS was in bypass mode (for servicing) would have a similar effect.

Re: What happened to British Airways computers?

Posted: Thu Jun 08, 2017 8:48 am
by hippy
My suspicion is they had inadvertently created some kind of deadlock situation. Code which updates a database can be difficult to sort out and get running again if that code expects the initial database to exist when it doesn't.

Re: What happened to British Airways computers?

Posted: Thu Jun 08, 2017 6:26 pm
by DougieLawson
hippy wrote:My suspicion is they had inadvertently created some kind of deadlock situation. Code which updates a database can be difficult to sort out and get running again if that code expects the initial database to exist when it doesn't.
Not a chance of that. They're not disclosing the truth because it's going to be embarrassing for Willy Walsh (who wants to keep his bonus). It's bound to be due incompetence and lack of planning (possibly off-shore) during recovery. Unless someone is willing to risk their job and disclose everything we'll never get round the mis-information and terminological inexactitudes that we've had so far.

Having seen the way part of an organisation I work for recovered from a significant power outage this week (albeit not in a data centre but in a very key installation) was amazing, there's an airline that could learn a lot from it. They had a robust plan, they had a call out list, they got the right folks to focus on solving their sections of problem in an organised & structured way. The event was out of the blue, the already built and tested recovery plan simply worked and worked well.

Re: What happened to British Airways computers?

Posted: Thu Jun 08, 2017 6:48 pm
by rpdom
I have seen some interesting power failures in datacentres - some of which the UPS and backup generators handled and some they didn't. Where I work now is much better in handling failures with multiple redundancy over several datacentres.