atmosteam
Posts: 68
Joined: Mon Jun 22, 2015 7:22 am

Redundancy

Tue Jun 23, 2015 4:32 pm

Hello,
I am designing a system which has a Raspberry, but I want redundancy in case of failure. If it is possible, I want active redundancy which means that if one Raspberry fails, then the other one is turned on and debugs by itself the corresponding proggram.
I ask you if you know whether it is possible or not to implement something like this working automatically.

I want to know if it is possible to have the non-used raspberry, not powered off, but in an idle mode with low consumption without working and wake it up with an input signal.

Any suggestion?

User avatar
kusti8
Posts: 3439
Joined: Sat Dec 21, 2013 5:29 pm
Location: USA

Re: Redundancy

Tue Jun 23, 2015 4:38 pm

The raspberry pi by default is low power, there is no ultra low power mode. Just write a script that checks of the other Pi is on by monitoring the power, pinging it or some other way and then have that Pi call your script if it can't get a response. If you use an A+ as the backup, you can get even lower power consumption, but there is no low power mode (other than not doing anything) and it's not practical to turn on a Pi when it uses so little power anyways.
There are 10 types of people: those who understand binary and those who don't.

User avatar
DougieLawson
Posts: 36578
Joined: Sun Jun 16, 2013 11:19 pm
Location: Basingstoke, UK
Contact: Website Twitter

Re: Redundancy

Tue Jun 23, 2015 4:46 pm

Raspberry Pis do not have any booted but low power modes (unlike things like cell phones and tablets). They don't use much power when they're booted normally but running idle.

How do you plan to trigger your hot-standby? Are you looking at a hardware pin being held high until the primary dies or are you looking at a software heartbeat arriving across a permanently active network and a cut-over when the heartbeat goes missing (for long enough)? If you use a network heartbeat what happens when the network dies but the primary is still active, how do you cater for that and resynch, re-establish primary & secondary when the network problem ends?

How long can the primary system be inactive before the secondary must be alive and well and taking over the workload?
How do you plan to synchronise things? Where are you putting the data? How are you going to replicate it from primary to secondary?

After a hot-standby takeover, how do you get back to a normal configuration?

What happens if the primary fails due to a software bug and the hot-standby fails during takeover due to the same bug? Don't assume you're immune to that, it affected the best of breed on December 12th 2014.
http://www.caa.co.uk/docs/2942/v3%200%2 ... 202014.pdf
Note: Having anything humorous in your signature is completely banned on this forum. Wear a tin-foil hat and you'll get a ban.

Any DMs sent on Twitter will be answered next month.

This is a doctor free zone.

W. H. Heydt
Posts: 11112
Joined: Fri Mar 09, 2012 7:36 pm
Location: Vallejo, CA (US)

Re: Redundancy

Tue Jun 23, 2015 5:11 pm

A couple of other points to ponder... To really take over, the secondary might need to have data that is current as of when the primary fails, in which case the secondary needs to be fully active and updating all the time. If you're using a database, you would want to sync the databases and keep the secondary up to date.

While I haven't automated the process to the extent you appear to want (I have live humans involved all the time) but I do run a system that has a "hot backup" by running a master-slave configuration in MySQL across two SBCs. While this system doesn't use Pis, there is no reason why it couldn't. It does use Raspbian.

asandford
Posts: 1997
Joined: Mon Dec 31, 2012 12:54 pm
Location: Waterlooville

Re: Redundancy

Tue Jun 23, 2015 8:17 pm

atmosteam wrote:Hello,
I am designing a system which has a Raspberry, but I want redundancy in case of failure. If it is possible, I want active redundancy which means that if one Raspberry fails, then the other one is turned on and debugs by itself the corresponding proggram.
I ask you if you know whether it is possible or not to implement something like this working automatically.

I want to know if it is possible to have the non-used raspberry, not powered off, but in an idle mode with low consumption without working and wake it up with an input signal.

Any suggestion?
High Availability (or clustering in windows parlance) is a non-starter for the Pi as the cost of the supporting harware infrastructure will not be worth it. For true HA (and i've worked on many), you will need at least a heartbeat network (preferably low-level), shared drives (probably over iscsi if you can get it to work), and at least three nodes to provide a 'quorum'.

If the system you are designing is as critical as you describe, look to other hardware.

Heater
Posts: 13926
Joined: Tue Jul 17, 2012 3:02 pm

Re: Redundancy

Wed Jun 24, 2015 4:08 am

asandford,

I'm going to disagree with everything you have said there. Just for the sake of argument, no hard feeling :)

Fault tolerant systems (or multiply-redundant systems as the world calls them) are quite possible to build with the Pi and could be quite cheap. The supporting hardware need only provide multiple Pi with independent power supplies and a means of interconnecting connecting them all point to point.

Shared drives are not required and probably detrimental to the goal.

Now, it turns out that if you want to be sure the system will run correctly in the face of a single fault, failure of one Pi node or interconnect, then you need 4 such nodes each connected point-to-point to the others. It might seem that 3 would be sufficient to determine correct results by some kind of voting or consensus algorithms but Leslie Lamport demonstrated this is not so in his famous paper "The Byzantine Generals Problem" http://research.microsoft.com/en-us/um/ ... bs/byz.pdf. A fascinating read.

That paper determines that in general to tolerate n failures in a multiply-redundant system required 3n + 1 fully interconnected nodes.

A good place to start reading about these things is here: https://en.wikipedia.org/wiki/Byzantine ... Yeh2001-27. Strangely they say there that the Fly By Wire system (its Primary Flight Computers) of the Boeing 777 is Byzantine fault tolerant. Having worked for a year or more on the team testing that system I'm pretty sure it is not. There are only three PFC boxes working together!

An example of some fault tolerant database that can be implement on a network of such Pi is the etcd database: https://github.com/coreos/etcd

@atmosteam,

Without more detail of what you actually want your system to do and what level of reliability you are expecting it's probaly hard for people to make sugestions.
Memory in C++ is a leaky abstraction .

asandford
Posts: 1997
Joined: Mon Dec 31, 2012 12:54 pm
Location: Waterlooville

Re: Redundancy

Wed Jun 24, 2015 6:54 pm

Heater wrote:asandford,

I'm going to disagree with everything you have said there. Just for the sake of argument, no hard feeling :)
Fair enough, but everything I've said was taught at the week long clustering course I took at Veritas in Reading a few (many) years ago, and was still true for the windows cluster that I configured today.

Fault tolerant systems usually have multiple redundant components with no single point of failure, and aren't the same as HA systems (which the Op suggested by mentioning a warm standby).

HA, fail-over or clusters (as the windows world calls them) are not tolerant of faults and will fail (over) to another server (usually idle, but not always). The service tends to stop as it does so, albeit breifly (I beleive that VMWare can move VMs between hardware with no downtime tho').

Shared disks (perhaps a more accurate term would be software switchable) are an absolute must for both unix and Windows sysems to fail over (services, disk mount points and IP addresses go down on one system and come up on another), and a private heartbeat network is preferred to aviod 'split-brain'. Three or more sysems are also desirable to acheive state concensus.

A tutorial for building your own virtual windows cluster with a couple of windows VMs, SQL Server and a FreeNAS VM to provide the shared disks is here, probably a bit out of date, but worth it.

The acid test is to pull the mains lead(s) from your production server...as SAF used to say "squeaky bum time".

ghellquist
Posts: 68
Joined: Thu Aug 02, 2012 8:47 am
Location: Stockholm Sweden

Re: Redundancy

Wed Jun 24, 2015 9:15 pm

Too little information from the original poster.
Of course you could make a redundant system with Pi-s. But not any redundant system. It will take some extra hardware, possibly a lot of extra hardware.

Answers to a few of these questions might help in trying to give an answer. Each answer will influence the solution.
- what is the system supposed to do actually (except from failover) ?
- what errors can you detect?
- who or what is going to detect the error and initiate failover?
- what do you want to do when too many errors occur, "fail-safe"?
- how much disturbance in "service" can you accept during fail-over? Total loss of memory, no data loss accepted?
- how long disturbance can you accept? Days, hours, minutes, seconds, milliseconds, microseconds?
- what performance degradation can you accept during normal situation? Response times?
- who is programming the system?
- who is going to make the hardware adaptations?
- could you select a computer language and runtime system with a bit of support (say Erlang) ?
- what are the consequences of a hard failure? People dying?

// Gunnar

Heater
Posts: 13926
Joined: Tue Jul 17, 2012 3:02 pm

Re: Redundancy

Wed Jun 24, 2015 9:52 pm

asandford,

Notably people who run high availability systems now a days know nothing of Veritas, think Google, Facebook etc.

There is no way to do fault tolerant systems without multiple redundancy.

I can imagine that on detection of failure of a component the system stalls a bit whilst sorting out what happened and what to do about it. That is where we need to know the availability requirements.

Shared disks sounds like a disaster waiting to happen. Once a node has gone faulty and written crap to the disk it can then die, or be killed, and the next node can inherit the crap data. Brilliant!

Three nodes is not enough, as Leslie Lamport's Byzantine Generals paper shows.
The acid test is to pull the mains lead(s) from your production server...
A funny story:

When we were involved in testing the Boeing 777 Fly By Wire system the story came to us that the test pilot was to be paid 1 million dollars, up front, before setting foot in the plane. Made sense to me as life insurance must be a problem in such a job.

Anyway, the first flight was prepared and our test pilot boarded the plane. The first thing he did was to pull all the circuit breakers to the many digital systems on board. And then put them back in place. Sure enough a lot of the systems did not reboot correctly.

Fine "I'm not flying this machine anywhere until you have fixed that" he said. Job done!
Memory in C++ is a leaky abstraction .

atmosteam
Posts: 68
Joined: Mon Jun 22, 2015 7:22 am

Re: Redundancy

Thu Jun 25, 2015 4:35 pm

well, sorry for not answering... :shock:

This redundancy is not a super redundancy that makes the system to be operative at 99.99999% of the time, it is very simple.
I read that someone talked about the redundancy with 2n+1 boards and it is true, I studied it and I agree. but I am not looking for this kind of redundancy, I mean, I thought about detecting simple failures as, for example (this system can send SMS), I try to send a message and it is not possible, it is easy to detect with:

Code: Select all

try
    code
except ----
    more code
with this function I can easily detect a failure, the module which sends this data, if it is not able to do it, it means it has failed.
then I change of module.

the raspberry, if there is a short, the raspberry will stop working and the GPIO, ... (I know it by experience...). I must say also that there will be an Arduino (also with redundancy) and it will detect if Raspberry is working or not by sending a simple HIGH signal. as in the SPI, I2C, if line is HIGH, it means that modules and the channel are OK, this will work with the same principple.

Here, this arduino will be in charge of changing the Raspberry, arduino will send a message to the other raspberry to beggin working and the other raspberry will be isolated with analog switches so that no signals enter or exits.

well, that's all, I agree with you with lots of aspects (but also I must say that my boss decided to do it so...) :?

W. H. Heydt
Posts: 11112
Joined: Fri Mar 09, 2012 7:36 pm
Location: Vallejo, CA (US)

Re: Redundancy

Fri Jun 26, 2015 4:55 am

atmosteam wrote: This redundancy is not a super redundancy that makes the system to be operative at 99.99999% of the time, it is very simple.
Your application can only be down for a maximum of 32 seconds per year?

adlambert

Re: Redundancy

Fri Jun 26, 2015 6:45 am

Heater wrote:.

Shared disks sounds like a disaster waiting to happen. Once a node has gone faulty and written crap to the disk it can then die, or be killed, and the next node can inherit the crap data. Brilliant!
It's a pretty standard way of working - WFCS for example is only designed to protect from node failure, Availability groups does the shared nothing. And even where you have shared nothing and replication to keep the disks in sync, you can corrupt your data and replicate the corruption. There are ways to mitigate that, for example by having a delay on the replication and each extra system you put in place to cover yourself must be weighed against your actual business requirement to understand if it's worthwhile. People frequently go in for overkill, when in fact in reality they can withstand a couple of hours downtime while a well rehearsed recovery and restore takes place from backups, actually a quite easy process if you have VM image backups and systems with frequent backups such as transaction log backups - and you can recover to the point before the fail / error. In practice it's a rare thing to actually need, but does happen, like fire insurance.

There are a lot of lash up methods to do something close to HA, it's not always necessary to go with the supported processes like the aforementioned VCS (which is a support NIGHTMARE). If it works for your application (for example like static web servers) then you can simply use a script to detect errors and update DNS to redirect traffic to a standby host. Or spend a bit more and use a load balancer (but avoid NLB on Windows :) )

Here's your horror scenario:
Employee with too much access deletes important data at 9:30AM and either doesn't realise or is too scared to say anything.
Business continues to operate and millions of transactions go through the system, each transaction would have had its value influenced by the previously deleted data.
The problem is detected at 5PM (it's a Friday of course).

So, we can recover the database to the point of the failure and lose a days work (can we ever get those transactions back?)
We can recover a copy and extract the deleted data and inject it into the problem system - but then all the subsequent transactions should have reflected the presence of this data and so a developer has to create and test a fix to put it all right.

Last year we spec'd a system that recorded user activity so it could be replayed (not the same at transaction/redo logs - but more flexible) to get round this. The more you try and protect against, the more expensive it all gets.

So that's why we have managers to think about risks and stuff. If only there were some good ones.

Oh dear I've been rambling. Should get on with my work today..... designing an unnecessary database cluster.

User avatar
DougieLawson
Posts: 36578
Joined: Sun Jun 16, 2013 11:19 pm
Location: Basingstoke, UK
Contact: Website Twitter

Re: Redundancy

Fri Jun 26, 2015 9:19 am

adlambert wrote: Here's your horror scenario:
Employee with too much access deletes important data at 9:30AM and either doesn't realise or is too scared to say anything.
Business continues to operate and millions of transactions go through the system, each transaction would have had its value influenced by the previously deleted data.
The problem is detected at 5PM (it's a Friday of course).
I had one of those 25 years ago.

The mainframe hard disk crashed over night. So the operations department recovered the database, but they missed recovery of the indexes to the same point. The system comes up at 09:00 and dies (with index errors) at 11:30. In that short time £4,000,000,000 worth of transactions had been processed on a bad database. We had a jigsaw puzzle. How to get those valuable transactions back (we couldn't just forget them, we couldn't reprocess them).

We ended up getting a copy of the database back to the point where operations had screwed it up, and comparing it record for record with the broken database. Every record that differed generated an update that needed to be re-applied to another copy of the good database. It took us until 00:30 to get all the pieces collected, re-inserted and checked out that it was all consistent.

Data synchronisation and data replication has moved on just a teensy little a bit in twenty five years but a HA system wouldn't have protected us from that screw-up. We'd have ended up with two systems with logically damaged databases.

HA works well for hardware problems. It's next to useless for software errors and about as useful as an ashtray on a motorbike for human errors.
Note: Having anything humorous in your signature is completely banned on this forum. Wear a tin-foil hat and you'll get a ban.

Any DMs sent on Twitter will be answered next month.

This is a doctor free zone.

Heater
Posts: 13926
Joined: Tue Jul 17, 2012 3:02 pm

Re: Redundancy

Fri Jun 26, 2015 3:33 pm

adlambert & DougieLawson,

Your horror stories do indeed sound horrendous. However what you are describing seem to be consequences of human error rather than hardware fault tolerance.

Building systems that are resistant to human malfunction is a problem yet to be solved :)

Although, Dougie, that situation you describe starts with a hard drive failure, which of course should not be a single point of failure.

I get the idea that databases should be immutable. That is to say, whatever transactions went through in the past, correct or otherwise, stand forever as a record of what happened. If there is an error in some transaction discovered in the future one should not go back and try to recreate/rewrite history from that past event. Rather apply the correction as a a new transaction happening now.

I believe this is how bookkeepers deal with such errors. They cannot go back and rewrite the books of time gone by.

It's also how source code management should go with git.
Memory in C++ is a leaky abstraction .

asandford
Posts: 1997
Joined: Mon Dec 31, 2012 12:54 pm
Location: Waterlooville

Re: Redundancy

Sat Jun 27, 2015 11:48 pm

[quote="Heater"]asandford,Notably people who run high availability systems now a days know nothing of Veritas, think Google, Facebook etc.[quote]
Complete tripe, of course they know about VCS, windows clustering, etc
[quote="Heater"]There is no way to do fault tolerant systems without multiple redundancy.[quote]
We're not talking about fault tolerant systems, they cost millions
[quote="Heater"]I can imagine that on detection of failure of a component the system stalls a bit whilst sorting out what happened and what to do about it. That is where we need to know the availability requirements.[quote]
Not really, they have the ability to predict when components may fail, and generate alerts. If a power supply fails, there is no 'stall' as the other(s) are still working (same wth most hardware you care to mention)
[quote="Heater"]Shared disks sounds like a disaster waiting to happen. Once a node has gone faulty and written crap to the disk it can then die, or be killed, and the next node can inherit the crap data. Brilliant![quote]
That has nothing to do with HA systems, that's an application fault. To have consistency for system files, you have three or more copies so it's not a problem
[quote="Heater"]Three nodes is not enough, as Leslie Lamport's Byzantine Generals paper shows.[quote]
Please provide a link, but normally 3 nodes are sufficient to provide consensus (of course, the more the better).

That's gone a bit wrong!

asandford
Posts: 1997
Joined: Mon Dec 31, 2012 12:54 pm
Location: Waterlooville

Re: Redundancy

Sat Jun 27, 2015 11:59 pm

DougieLawson wrote: HA works well for hardware problems. It's next to useless for software errors and about as useful as an ashtray on a motorbike for human errors.
It also works well for application failures (crash, not garbage data).

Heater
Posts: 13926
Joined: Tue Jul 17, 2012 3:02 pm

Re: Redundancy

Sun Jun 28, 2015 12:51 am

asandford,

Yes that went very wrong. Not just in the formatting :) Let's see:

I'm sure Google, Facebook, and co. know about VCS, Windows Clustering and such. However I think you will find the don't run their businesses on Windows. Any evidence to the contrary?

We are indeed talking about fault tolerant systems. It's right there in the opening post.

Fault tolerant systems need not cost millions. It all depends on the functionality of the system, performance and level of reliability required. You could build a fault tolerant, multiply redundant LED flasher with four Arduinos! A network of four Rasperry Pis with will let you build a fault tolerant data store. I have one running etcd. You can build large scale fault tolerant databases with cheap PCs and open source databases. You can do all that on systems distributed around the world for a few dollars a month using the services of Google, AWS etc.

Yes one can predict component failure and take pre-emptive action. You are crazy to rely on that as it defies Murhy's Law!

Depending on your set up there may well be a stall when a node goes down. Some database systems have a "master" and a bunch of "slaves", perhaps all writes go through the master first, if that goes down it may take a while for the cluster to elect a new master.

If you have multiple nodes and hence copies of your data, what is the point of sharing a disk between them? Please elaborate.

For sure corrupt data on disc is not solely an application fault. There is plenty of hardware between app and disk that can fail and corrupt stuff silently.

The famous paper on the Byzantine Generals problem is here [url]http://research.microsoft.com/en-us/um/ ... yz.pdf[url]

Three nodes may well provide consensus most of the time. It does however have failure modes that can confuse it and cause it never to reach consensus.
Memory in C++ is a leaky abstraction .

adlambert

Re: Redundancy

Sun Jun 28, 2015 7:18 pm

Yes, it might be easy to get drawn into thinking that HA is subject to some specific rules in the way that you implement it. In fact there are many ways to skin a cat and the balance of risk against solution is where we make our judgement.

And it's also easy to try and simplify by dismissing human errors as something entirely different than hardware failures, when in fact if you are going to the trouble of protecting yourself against one type of error then you might as well consider how you might use some of that effort to protect against another.

And there are ways to defend against the error horror stories described, just not with your average system.

Those phrases High Availability, Disaster Recovery, Business Continuity are all connected, the human error can be a disaster, and the problem runs its tentacles well beyond the IT department. In fact IT does this stuff routinely now, and most of the problem is convincing those out in the core business that they need to get involved when they would rather just simplify the problem as purely an IT one.

And VCS, most of my encounters have been with people having trouble with it and the first thing we do is start planning how we are going to bin it.


In the end, I'm grateful for it all, because it's been paying my salary for more than 20 years.

asandford
Posts: 1997
Joined: Mon Dec 31, 2012 12:54 pm
Location: Waterlooville

Re: Redundancy

Mon Jun 29, 2015 8:22 pm

Heater wrote:asandford,

Yes that went very wrong. Not just in the formatting :) Let's see:

I'm sure Google, Facebook, and co. know about VCS, Windows Clustering and such. However I think you will find the don't run their businesses on Windows. Any evidence to the contrary?

We are indeed talking about fault tolerant systems. It's right there in the opening post.

Fault tolerant systems need not cost millions. It all depends on the functionality of the system, performance and level of reliability required. You could build a fault tolerant, multiply redundant LED flasher with four Arduinos! A network of four Rasperry Pis with will let you build a fault tolerant data store. I have one running etcd. You can build large scale fault tolerant databases with cheap PCs and open source databases. You can do all that on systems distributed around the world for a few dollars a month using the services of Google, AWS etc.

Yes one can predict component failure and take pre-emptive action. You are crazy to rely on that as it defies Murhy's Law!

Depending on your set up there may well be a stall when a node goes down. Some database systems have a "master" and a bunch of "slaves", perhaps all writes go through the master first, if that goes down it may take a while for the cluster to elect a new master.

If you have multiple nodes and hence copies of your data, what is the point of sharing a disk between them? Please elaborate.

For sure corrupt data on disc is not solely an application fault. There is plenty of hardware between app and disk that can fail and corrupt stuff silently.

The famous paper on the Byzantine Generals problem is here [url]http://research.microsoft.com/en-us/um/ ... yz.pdf[url]

Three nodes may well provide consensus most of the time. It does however have failure modes that can confuse it and cause it never to reach consensus.
Long day, CBA to split yor post, so here we go:
1. I've only used VCS on Solaris (AIX has its own system). Google keep their cards very close to their chest, so your guess is as good as mine as to what they run.
2.No were not talking about FT systems, it's right there in the OP - "hot standby"- FT systems DON'T failover, they DON'T fail. Full stop. End of. If an FT sytems fails over, it obviously isn't FT ('cos it's failed!)
3. I give up ... (HA is not FT, and FT is not HA) . If they were the same, then why have they got different names?
4. There are times that 3 nodes may not reach consenus, but you stated that you never need that many (and <3 can never reach consensus). I've built a 12 node HA system, and I'd be very surprised if that didn't ever reach consensus.
5.With shared drives, you have *one* copy of the data: the drive share, LUN, NAS path, iscsi address, whatever; it is moved between active server
6. "Yes one can predict component failure and take pre-emptive action. You are crazy to rely on that as it defies Murhy's Law!"- I'm sure IBM would have loved to have known that, as they obviously didn't when they built it into all the the various ?Series servers (x, i, p take your choice) - you could have probably saved them millions with your insight.
7. There is no solution (apart from regular backups - my specialist subject) that will help against application corrupted data, not HA or FT
8. If your HBA or disk controller can 'silently' corrupt data? - I've seenn plenty of both types of failure, but they have have always written to logs (you do look at logs?)
9. That PDF was written in 1982, things have moved on since then!

asandford
Posts: 1997
Joined: Mon Dec 31, 2012 12:54 pm
Location: Waterlooville

Re: Redundancy

Mon Jun 29, 2015 8:27 pm

adlambert wrote: In the end, I'm grateful for it all, because it's been paying my salary for more than 20 years.
Same here.

User avatar
r4049zt
Posts: 113
Joined: Sat Jul 21, 2012 1:36 pm
Contact: Website

Re: Redundancy

Mon Jun 29, 2015 8:50 pm

Given that a raspberry pi B+ uses about 2 to 3 Watts, you could have eight of them always on (yes !eight) with less power than a <brand-name> rival, all running some sort of check-the-other LAN program, and tell the router that the first one which agrees with the majority is the server to get outside traffic. See the above post about the Byzantine Generals problem. four rPi servers can do best-of-three while one is broken. eight can cope with two failures. You'd probably want multiple independent power supplies since a brownout becomes more of a risk than three dead rPi servers. Your multiple independent UPS becomes a critical part of the system, but at least it is small. car batteries and 5V regulated buck converters anyone?

The most probable cause of end of life is updates creep, so choose an operating system which says that it won't break and need major surgery every few years, nor need increasing updates churn.

Heater
Posts: 13926
Joined: Tue Jul 17, 2012 3:02 pm

Re: Redundancy

Tue Jun 30, 2015 11:11 am

asandford,

Thanks for the interesting debate. I agree and disagree as follows:

Yes, Google is not going to be discussing the details of it's "secret sauce". However we have some clues to inform any guess work as to what they use. Google is the about tenth biggest corporate contributor to the Linux kernel, after guys like IBM and Intel. They are about the 10th biggest introducer of new developers to the kernel project this year. They are heavily involved in projects like the Clang/LLVM compiler and tools. We can get clues from the many technical presentations they have on YouTube. All this and more points to Linux based systems.

We are indeed talking about fault tolerant systems, it's right there in the opening post "I want redundancy in case of failure..." That clearly says I want the system, as a whole, to work even if a part of it fails, i.e. fault tolerant.

Of course FT and HA have different names. They are descriptive of different requirements of a system. FT is all about what happens when parts of a system fails, how many simultaneous internal faults can it tolerate for example. We might even decide that the reliability of some parts is so high we won't even worry about failures there. HA is all about, well, availability. Is that service down for more than one hour a year? Or perhaps 10ms per second as in fly by wire systems.

Lamport shows that there are always possible failure modes where 3 nodes will fail to reach a decision in the face of failure of a single node or interconnect. I don't believe I ever said "...you never need that many [3]". Do point out where I did and I will fix the error.

You will have to elaborate on that shared drive thing. If the drive or it's controller or interconnect fail having it shared is pointless. If you have multiple-redundant replica drives to make a failure tolerable why share them?

You are crazy to rely on component failure prediction. Obvious emphasis there is on "rely". Obviously any component can fail before it's predicted time, or the prediction itself can fail:). That does not mean there are not good reasons to have such prediction. Clearly the probability of failure of any component rises with age, see "bath tub curve", or there is a limit to write cycles or whatever. Swapping out aged components may well make it statistically less likely that more faults happen than your system can handle.

I agree, if your application is buggy, dutifully doing what you accidentally programmed there, none of this helps. Hence the sometimes used technique of having multiple teams develop the same application functionality using different tools. An attempt to make the software itself a component that can fail and be tolerated.

Suggesting the Byzantine Generals paper is now wrong or irrelevant is like suggesting Newtonian mechanics is now old and therefore irrelevant. The faults it describes are still with us as surely as f=ma or Pythagoras Theorem. The little formula for the number of required nodes and connections required in order to operate with given number of faulty nodes is still true. Yes indeed the algorithms to handle this have been developed over the years.
Memory in C++ is a leaky abstraction .

Return to “General discussion”