Greg Cope mailed us a few weeks ago with a pointer to this project, which has been monitoring DevOps at the Financial Times (FT) here in the UK.
It’s hard for everyone in the group to simultaneously maintain an overview of the health of the stack under normal circumstances. They use Nagios, a great piece of kit with one fatal flaw: Nagios emails everybody on the team every time a check changes state. Checks change state all the time, and that many emails causes the FT team to enter a state where absolutely none of them reads emails from Nagios, because they clog up their inboxes.
Silvano Dossan, who works on the team, says:
Our Nagios servers have been configured to check every important parameter, from basic disk and CPU checks to HTTP, application, database and jconsole via Jolokia. All we need is some way to communicate clearly when a check fails.
The team rejected shared office displays in the form of monitors (too much text, too hard to read from a distance). They also rejected a particularly horrible idea whereby a single team member would be allotted the task of staying alert and monitoring all Nagios’ mails for the week, feeding back news of any disasters to the rest of the group. Sounds horrific.
Silvano sat back to think about exactly what they needed and didn’t need from alerts.
None of the above satisfied our needs. Something is missing. When something fails I want an alarm bell, a siren, or a flashing light that is so bright my eyes explode. A warning system that is in everyone’s face. No escape. There should be no excuse for anyone to not know when something in the stack has broken. “What do you mean you didn’t know the site was down, there is a mongoose running around the office!”
Introducing SAWS ! “Silvano’s Awesome Warning System”.
Well I did spend my evenings and weekends making this so forgive me the naming it after myself.
Rejecting the mongoose idea, Silvano bought a strip of something called Blinkytape (having looked at their website I’m off to buy one myself when I’m done writing this): a flexible strip of 60 RGB LEDs, with a microcontroller already embedded in the strip. Using a Raspberry Pi and a lot of glue and sticky tape, he produced a perfectly simple, unmissable display to demonstrate the health of the stack.
A good monitor system should display the health status of the stack to as many people as possible in as simple format as possible. The more people that know the health state of the stack, the better chance of someone picking it up and resolving the issue quickly.
SAWS simply shows by grouping LED’s if each Nagios server has an error. Green, orange, yellow, red and flashing red LED’s representing OK, Unknown, Warning, Critical or Critical for over 30 minutes. Blue LED’s swoosh back and forth like a Cylon to indicate the python script is running and the data is up to date.
It’s an ingenious solution: and it works. There can’t be a cleaner stack in the country, now SAWS is in place, and the team have been incredibly enthusiastic about the change. You can read more about it over at Engine Room the FT’s tech blog; and Silvano has made all the code available at GitHub.