YoungJules
Posts: 14
Joined: Thu Jan 26, 2012 12:13 pm

Reliability in production

Wed Jan 22, 2014 8:26 am

I'm testing a few Raspberry Pi's for 'production' environments... examples of the usage are for remote control of heating systems (now two already deployed), and for access control (someone presents a token and the Pi decides whether to allow access or not).

The requirement I have is that the systems have to be 100% rock-solid reliable. This morning, for some unknown reason, the Pi I have here controlling my heating system just stopped responding. I couldn't reach the webpage where I control my heating, nor could I SSH into the box. When I came to look at the Pi, I could see that the blue light on the Edimax wifi dongle was continuously lit, instead of flashing as is normal. After killing the power and restarting, I could get into the box. From the syslog, I could see that, for some reason, the Pi rebooted at 5am this morning. Further investigation of the syslog has not revealed any obvious reason for why the thing rebooted nor for why the thing appeared to have 'locked up' after the reboot.

Perhaps significant is that I did update the firmware yesterday on my home-heating-controlling Pi, but after the update everything appeared to be fine. The second Pi I have controlling the heating at a client site has not yet been updated. There have also been problems with 'reliability' of that system too, but some of those problems have for sure been caused by the client being careless with the cables, placement of the Pi etc.

So I guess I have a couple questions:

In general, how reliable are you all finding the RaspberryPi in 'high-availability', 'long-uptime' environments?
Specifically, any suggestions on how better to track down the problem that manifested itself this morning?

Thanks for any and all help!

Kind regards,
YoungJules

User avatar
simonmcc
Posts: 181
Joined: Mon Aug 19, 2013 10:07 pm

Re: Reliability in production

Wed Jan 22, 2014 11:41 am

I also have two pi's deployed for central heating control, one in a community hall, one at home, and I have a garage door control pi too.

They seem to be fine for long term reliability, but I've applied a bit of strategy to make it that way.

My heating controller and garage pi's both have scripts executing once per minute 24/7 to check if they have a connection to my router, and if not they cycle wlan0 until they do. This is required as sometimes connection drops, and I get the same result as you.

Secondly, I never just do an update on any of the systems. Any software upgrade needs soak tested on a non-live test environment for a significant period of time (depending on how big the upgrade is).

Thirdly, I have an extra image of the production SD card in each of the installations, so that if it goes belly-up then I can swap the card and have a stable system.

You may find something in one of the /var/log/* files which might give you a clue. My home heating controller started randomly rebooting and then freezing, and I took the card out and did an upgrade on it to make it work again.

I am not 100% convinced that they are great for production use, IMHO they need a certain amount of babysitting in anything but the most clinical environments.
simonmcc.blogspot.com/search/label/pi

jamesh
Raspberry Pi Engineer & Forum Moderator
Raspberry Pi Engineer & Forum Moderator
Posts: 26452
Joined: Sat Jul 30, 2011 7:41 pm

Re: Reliability in production

Wed Jan 22, 2014 11:52 am

I think most problems over very long term use will be down to the SD card wearing out. There are certainly things that can be done to mitigate that risk though - reduce the writes to card, perhaps use a small USB HD for storage.

There are also the standard things that affect long term devices of any sort - power outs, radiative noise (lightning EMP etc), software bugs, that you need to protect against.

I have a heating controller - a proper commercial one, a few years old. Every now and then it crashes, leaving the heating going full blast. So even the commercial ones have their faults.
Principal Software Engineer at Raspberry Pi (Trading) Ltd.
Contrary to popular belief, humorous signatures are allowed.
I've been saying "Mucho" to my Spanish friend a lot more lately. It means a lot to him.

shuckle
Posts: 565
Joined: Sun Aug 26, 2012 11:49 am
Location: Finland

Re: Reliability in production

Wed Jan 22, 2014 12:29 pm

I have three production Pis monitoring weather, water, oil and electricity and showing camera streams. All of those have been rock solid.
I only have to reboot the camera Rpi (raspberry6) since omxplayer sometimes gets stuck after restarts.
So I think these are very suitable for production after you find the stable system, which is not always easy.
And I agree that if it is in production and you manage to get it stable, why would you want to upgarde any software. (Unless it is connected to external internet, then you naturally need to install security updates.)

Code: Select all

~$ ssh raspberry1 w
 14:21:31 up 362 days, 23:53,  0 users,  load average: 0,67, 0,38, 0,37
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
$ ssh raspberry4 w
 14:21:41 up 189 days,  1:39,  0 users,  load average: 0.19, 0.06, 0.06
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
$ ssh raspberry6 w
 14:21:46 up 21 days,  4:58,  0 users,  load average: 1.36, 1.35, 1.43
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
$ ssh raspberry1 uname -a
Linux raspberrypi 3.1.9+ #272 PREEMPT Tue Aug 7 22:51:44 BST 2012 armv6l GNU/Linux

YoungJules
Posts: 14
Joined: Thu Jan 26, 2012 12:13 pm

Re: Reliability in production

Wed Jan 22, 2014 2:13 pm

Thanks for the replies so far!
Secondly, I never just do an update on any of the systems. Any software upgrade needs soak tested on a non-live test environment for a significant period of time (depending on how big the upgrade is).
I just want to clarify that the Pi I updated was kind of a test one... it's the one I have here at home connected to my heating system. So, if it goes wrong, I can easily get to it and if the worst happens, I can always switch on my heating manually (or put on an extra sweater)! The reason I updated this one is that I can test and see how it goes, then later update the Pi's I have onsite at remote locations.

Also, just to be clear, none of these Pi's are providing life-critical services... it's either remote-control for a heating/lighting system or an access-control system we're talking about for the time being.

The main thing that's causing me concern is that there doesn't seem to be any 'explanation' in the logs about why it suddenly locked up or why it rebooted just before that in the early hours of the morning. No other PCs or servers were affected.

I take on board the suggestion to use good SD cards, and will swap this one out for a known good one.

All other pointers/suggestions are still welcome!

Thanks and kind regards,
YoungJules

Ravenous
Posts: 1956
Joined: Fri Feb 24, 2012 1:01 pm
Location: UK

Re: Reliability in production

Wed Jan 22, 2014 2:24 pm

YoungJules wrote: All other pointers/suggestions are still welcome!
Suggestion 1: don't update the software on a system unless it needs updating. But of course yours is effectively the test system, so that's fine. You successfully detected a possible fault. (Assuming something connected with the update caused the failure.)

Suggestion 2: triple redundant system. But that's way beyond your requirement of course, so I'm just saying thin "in theory". Someone interested in this could try three identical computers running different releases of the software, so a glitch like this would happen to one and its demise detected by the others.

By the way I don't suppose some sort of mains glitch could have caused the problem? (Something like that could trash the fancy triple system too, of course.)

YoungJules
Posts: 14
Joined: Thu Jan 26, 2012 12:13 pm

Re: Reliability in production

Sat Feb 22, 2014 10:43 pm

simonmmc, would you care to share your script? :)

I wrote/adapted one, but it is only looking, and doesn't do anything useful if it finds a problem...plus I'm not quite sure how useful some of the info is in tracking down the error... it doesn't seem to be getting hot, running out of ram, running too many processes...
Date/Time: Saturday 22 February 2014 at 23:40
Free RAM: 338 (374)
Nr. of processes: 83
Up time: 54 min
Nr. of connections: 1
Temperature in C: 47.6
IP-address: 10.26.9.250
CPU speed: arm_freq=0

Ping router: OK
wlan0 IEEE 802.11bgn ESSID:"JULESHOMELAN" Nickname:"<WIFI@REALTEK>"
Mode:Managed Frequency:2.412 GHz Access Point: 60:A4:4C:9F:5F:68
Bit Rate:150 Mb/s Sensitivity:0/0
Retry:off RTS thr:off Fragment thr:off
Power Management:off
Link Quality=95/100 Signal level=49/100 Noise level=0/100
Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0
Tx excessive retries:0 Invalid misc:0 Missed beacon:0
When it goes wrong, having installed postfix, I now get a local mail on the pi:
From pi@Domotica Sat Feb 22 21:40:02 2014
Return-Path: <pi@Domotica>
X-Original-To: pi
Delivered-To: pi@Domotica
Received: by Domotica (Postfix, from userid 1000)
id 8750F24DBF; Sat, 22 Feb 2014 21:40:02 +0000 (UTC)
From: root@Domotica (Cron Daemon)
To: pi@Domotica
Subject: Cron <pi@Domotica> /home/pi/dev/python/system_info.py
Content-Type: text/plain; charset=UTF-8
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin>
X-Cron-Env: <HOME=/home/pi>
X-Cron-Env: <LOGNAME=pi>
Message-Id: <20140222214002.8750F24DBF@Domotica>
Date: Sat, 22 Feb 2014 21:40:02 +0000 (UTC)

connect: Network is unreachable
ssh: connect to host 10.26.9.200 port 22: Network is unreachable
lost connection
I guess I just need to add something that will force a reboot or reload of wlan0 when I get that error...

Thanks again all, and kind regards,
YoungJules

User avatar
simonmcc
Posts: 181
Joined: Mon Aug 19, 2013 10:07 pm

Re: Reliability in production

Mon Feb 24, 2014 10:24 am

YoungJules wrote:simonmmc, would you care to share your script? :)
Yes, I'll try to remember to post it later, I don't have access to it from here
simonmcc.blogspot.com/search/label/pi

coolblue2000
Posts: 30
Joined: Tue Jan 08, 2013 9:48 am

Re: Reliability in production

Mon Feb 24, 2014 10:39 am

Is the wifi dongle plugged in to a powered hub? If not then could it have drawn too much power at some point and caused the system to crash?

mikerr
Posts: 2825
Joined: Thu Jan 12, 2012 12:46 pm
Location: UK
Contact: Website

Re: Reliability in production

Mon Feb 24, 2014 11:54 am

The Pi has a hardware watchdog that you can enable:

http://harizanov.com/2013/08/putting-ra ... g-to-work/

That'll save you from OS freezes, but not from a wifi adaptor dropping out.
Android app - Raspi Card Imager - download and image SD cards - No PC required !

User avatar
simonmcc
Posts: 181
Joined: Mon Aug 19, 2013 10:07 pm

Re: Reliability in production

Tue Feb 25, 2014 10:24 am

My script to check is actually a perl script, simply because I like perl, and can knock something up quickly.

Here is the script: (/home/pi/network/wireless_check.pl)

Code: Select all

#!/usr/bin/perl


my $status = `/sbin/ifconfig wlan0`;
my $down  = 0;
if($status =~ /inet addr:/){
      	#check if we can see the default gateway
	my $gateway = `/sbin/ip route list`;
	($gateway) =  $gateway =~ /^.*?(\d+\.\d+\.\d+\.\d+)/;
	my $ping = `ping -c 1 $gateway`;

        if($ping !~ /bytes from/){
                $down=1;
                print"down\n";
                $status.=$ping;
        }

        #do nothing
        #print localtime . " connection is up\n";
}else{
        $down=1;

}
if($down){
        print localtime . " connection is DOWN\n";
        print "status: " . $status;
        print `/sbin/ifup --force wlan0`;
}
it is then ran every minute using cron (sudo crontab -e and add this line at the end)

Code: Select all

  * *  *   *   *     /home/pi/network/wireless_check.pl >> /var/log/wireless_check.log
This runs each minute, and will only write to the log if there is a problem with the connection. In my case this actually happens very rarely.

Hope this helps someone

[updated with input from DougieLawson]
Last edited by simonmcc on Tue Feb 25, 2014 1:17 pm, edited 1 time in total.
simonmcc.blogspot.com/search/label/pi

User avatar
DougieLawson
Posts: 38885
Joined: Sun Jun 16, 2013 11:19 pm
Location: A small cave in deepest darkest Basingstoke, UK
Contact: Website Twitter

Re: Reliability in production

Tue Feb 25, 2014 11:01 am

You should change your perl script to determine the default router by reading /proc/net/route rather than having a hard coded value.

In a pure IPv6 network use /proc/net/ipv6_route
Note: Any requirement to use a crystal ball or mind reading will result in me ignoring your question.

Criticising any questions is banned on this forum.

Any DMs sent on Twitter will be answered next month.
All non-medical doctors are on my foes list.

User avatar
DougieLawson
Posts: 38885
Joined: Sun Jun 16, 2013 11:19 pm
Location: A small cave in deepest darkest Basingstoke, UK
Contact: Website Twitter

Re: Reliability in production

Tue Feb 25, 2014 12:39 pm

I found it was easier to parse the /proc/net/route table using the ip route list command.

Code: Select all

    my $status = `/sbin/ifconfig wlan0`;
    my $down  = 0;
    my $gateway = `/sbin/ip route list`;
    $gateway =~ '^*(\d+\.\d+\.\d+\.\d+)';
    my $gw = $1;
    if($status =~ /inet addr:/){
            #check if we can see the default gateway
            my $ping = `ping -c 1 $gw`;
Note: Any requirement to use a crystal ball or mind reading will result in me ignoring your question.

Criticising any questions is banned on this forum.

Any DMs sent on Twitter will be answered next month.
All non-medical doctors are on my foes list.

User avatar
simonmcc
Posts: 181
Joined: Mon Aug 19, 2013 10:07 pm

Re: Reliability in production

Tue Feb 25, 2014 1:18 pm

DougieLawson wrote:I found it was easier to parse the /proc/net/route table using the ip route list command.

Code: Select all

    my $status = `/sbin/ifconfig wlan0`;
    my $down  = 0;
    my $gateway = `/sbin/ip route list`;
    $gateway =~ '^*(\d+\.\d+\.\d+\.\d+)';
    my $gw = $1;
    if($status =~ /inet addr:/){
            #check if we can see the default gateway
            my $ping = `ping -c 1 $gw`;
Thanks Dougie, I have updated my post with code similar to this, it's a good addition.
simonmcc.blogspot.com/search/label/pi

Raspberry Paul
Posts: 85
Joined: Mon Jun 10, 2013 3:40 pm
Contact: Website

Re: Reliability in production

Tue Feb 25, 2014 7:03 pm

In my experience the code can be developed to trap errors and prevent issues. It's taken my over a year to harden my home built weather system. My most frequent problem is the wifi dropping.

A simple bash script for checking the Pi has a network connection.

Code: Select all

#!/bin/bash

while true ; do
   if ifconfig wlan0 | grep -q "inet addr:" ; then
      echo "                                         n ok"
      sleep 60
   else
      echo "                                         Network down!"
      ifup --force wlan0
      echo "                                         Sleeping"
      sleep 20
   fi
done
The indentation of the output was to separate it from my AirPi information
http://www.raspberrypaul.co.uk

onenazlyhelmy
Posts: 1
Joined: Wed Sep 07, 2016 10:20 am
Location: ddddddddddd

Re: Reliability in production

Wed Sep 07, 2016 10:32 am

i've ran 24/7 every second and minute trigger event using Java.....its purely stable.. i've almost 2 year..without shutdown..
for me raspberry is extremely stable...

Kurt in Space
Posts: 5
Joined: Sun May 29, 2016 2:40 pm
Location: United States
Contact: Website

Re: Reliability in production

Wed Sep 07, 2016 2:01 pm

I agree with simonmcc. Before you put any computer into a production environment, you must sit back and think about anything that might cause the Pi to generate/respond to a fault. Especially the unplanned ones.

Is your power supply stable and reliable? Are any of the components in the power supply at risk for failure? I've had supplies that I thought were solid until I put an oscilloscope on them, then found a spike getting through to the output that was at 2 times the output voltage. This eventually caused the computer to randomly fail. A $0.45 tantalum cap on the power supply input fixed that.

Are there any unterminated pins on your Pi? ie. the GPIO buss. All unused pins should be set to output and set at a low state. Any pins left floating at an input state are just asking for trouble.

Are there any default routines running that are not part of your project? Some cron function not needed? If it's not needed, disable it. Better yet, delete it. The default Rasbian package has a lot of files that won't be needed for your project. Delete them.

In addition, set up routines in your software to log changes. It's easy to write a line to /var/log/messages any time something changes state, or if something unexpected happens. Great for troubleshooting.

Kurt

User avatar
B.Goode
Posts: 10197
Joined: Mon Sep 01, 2014 4:03 pm
Location: UK

Re: Reliability in production

Wed Sep 07, 2016 2:14 pm

Of course, good information is never out of date... but it is 55 months since the question was originally asked, and @YoungJules hasn't accessed the forums for nearly a year.

User avatar
CarlRJ
Posts: 598
Joined: Thu Feb 20, 2014 4:00 am
Location: San Diego, California

Re: Reliability in production

Wed Sep 07, 2016 3:08 pm

B.Goode wrote:Of course, good information is never out of date... but it is 55 months since the question was originally asked, and @YoungJules hasn't accessed the forums for nearly a year.
I noticed that too. Thread necromancy is a dark art that should not be practiced lightly. Comments ages after the fact resurfacing old threads cause more confusion than they are usually worth.

The whole thread started out with a completely ridiculous statement anyway: "The requirement I have is that the systems have to be 100% rock-solid reliable." Yet the poster was looking to spend $35 on a computer made for the education market, instead of the millions of dollars needed to get anywhere near close to his core demand. The computers on interplanetary space probes are not 100% reliable. He didn't say, goal, he made it a requirement.

It'd be nice if the forum software could be tweaked to examine a thread and if there's a year-long gap between very-recent posts and earlier ones, put a really giant red-lettered banner at the top and bottom of the thread pointing out that is is a very old thread that has been possibly unwisely resurfaced, and to please let it die with dignity (of course, that would have applied to what I am posting now, sigh). Much better to make an entirely new thread, quote the relevant part of a post from the old thread, and link to the old thread.

YoungJules
Posts: 14
Joined: Thu Jan 26, 2012 12:13 pm

Re: Reliability in production

Sat Apr 01, 2017 4:31 pm

Yes, it's true, I've been away for a while.

Much has happened in the meantime, including several new Pi's, the Pi Compute Module(s), and the zero(es). I still hold that the original premise of my post was not ridiculous, people were free to answer "if you need 100% reliability, then the Pi is not the solution".
These days, it should be even more possible to get pretty good reliability with a UPS and three Pi's in a cluster... and my system providing thermostatic control of a heating system has benefited from months/years of gradual improvement and bug-fixing. :shock:
Sorry for again waking up a zombie thread... perhaps as suggested they should get marked archived as much of the advice may no longer be relevant (as I found trying to figure out how to get my 1-wire thermometer working on a new Raspbian install).

I still really appreciate the forum and already feel sorry now for having been away for so long :-)

Return to “General discussion”