Heater wrote:asandford,
Yes that went very wrong. Not just in the formatting

Let's see:
I'm sure Google, Facebook, and co. know about VCS, Windows Clustering and such. However I think you will find the don't run their businesses on Windows. Any evidence to the contrary?
We are indeed talking about fault tolerant systems. It's right there in the opening post.
Fault tolerant systems need not cost millions. It all depends on the functionality of the system, performance and level of reliability required. You could build a fault tolerant, multiply redundant LED flasher with four Arduinos! A network of four Rasperry Pis with will let you build a fault tolerant data store. I have one running etcd. You can build large scale fault tolerant databases with cheap PCs and open source databases. You can do all that on systems distributed around the world for a few dollars a month using the services of Google, AWS etc.
Yes one can predict component failure and take pre-emptive action. You are crazy to rely on that as it defies Murhy's Law!
Depending on your set up there may well be a stall when a node goes down. Some database systems have a "master" and a bunch of "slaves", perhaps all writes go through the master first, if that goes down it may take a while for the cluster to elect a new master.
If you have multiple nodes and hence copies of your data, what is the point of sharing a disk between them? Please elaborate.
For sure corrupt data on disc is not solely an application fault. There is plenty of hardware between app and disk that can fail and corrupt stuff silently.
The famous paper on the Byzantine Generals problem is here [url]
http://research.microsoft.com/en-us/um/ ... yz.pdf[url]
Three nodes may well provide consensus most of the time. It does however have failure modes that can confuse it and cause it never to reach consensus.
Long day, CBA to split yor post, so here we go:
1. I've only used VCS on Solaris (AIX has its own system). Google keep their cards very close to their chest, so your guess is as good as mine as to what they run.
2.No were not talking about FT systems, it's right there in the OP - "hot standby"- FT systems DON'T failover, they DON'T fail. Full stop. End of. If an FT sytems fails over, it obviously isn't FT ('cos it's failed!)
3. I give up ... (HA is not FT, and FT is not HA) . If they were the same, then why have they got different names?
4. There are times that 3 nodes may not reach consenus, but you stated that you never need that many (and <3 can never reach consensus). I've built a 12 node HA system, and I'd be very surprised if that didn't ever reach consensus.
5.With shared drives, you have *one* copy of the data: the drive share, LUN, NAS path, iscsi address, whatever; it is moved between active server
6. "Yes one can predict component failure and take pre-emptive action. You are crazy to rely on that as it defies Murhy's Law!"- I'm sure IBM would have loved to have known that, as they obviously didn't when they built it into all the the various ?Series servers (x, i, p take your choice) - you could have probably saved them millions with your insight.
7. There is no solution (apart from regular backups - my specialist subject) that will help against application corrupted data, not HA or FT
8. If your HBA or disk controller can 'silently' corrupt data? - I've seenn plenty of both types of failure, but they have have always written to logs (you do look at logs?)
9. That PDF was written in 1982, things have moved on since then!