Heater
Posts: 15949
Joined: Tue Jul 17, 2012 3:02 pm

Using the watchdog.

Tue Jun 20, 2017 2:09 am

I thought I quickly setup the watch dog so that my node.js program could initialize it and then ping it. No such luck for me.

Firstly the instructions all over the net are a) mostly out of date and b) often huge, complex and incomprehensible.

Secondly, although there is a node.js module for working with the watchdog`(node-pi-watchdog) it does not install on Jessie Lite with node version 8.1.2. That module seems not have been maintained in three years.

So how can I get the watchdog working?

Preferably I don't want systemd involved. Unless it's a setup that can be explained in 20 lines.

Presumably I can just write some simple code to talk to the watchdog device and get it going. But what do I need to do in my code?
Memory in C++ is a leaky abstraction .

epoch1970
Posts: 5131
Joined: Thu May 05, 2016 9:33 am
Location: Paris, France

Re: Using the watchdog.

Tue Jun 20, 2017 9:19 pm

What I do normally is install the linux watchdog package (apt-get install watchdog). And my programs interact with it.
For direct access, I don't know. But reading the source for watchdog could be a start?
"S'il n'y a pas de solution, c'est qu'il n'y a pas de problème." Les Shadoks, J. Rouxel

Heater
Posts: 15949
Joined: Tue Jul 17, 2012 3:02 pm

Re: Using the watchdog.

Thu Jun 22, 2017 11:55 pm

epoch1970,

OK, I'd be happy with interacting with the watchdog daemon which in turn does it's thing with the actual watchdog driver and hardware.

So far, I have the watchdog package installed and watchdogd is now running.

After adding the line:

Code: Select all

[Install]
WantedBy=multi-user.target
to to /lib/systemd/system/watchdog.service and enabling the watchdog with the command:

Code: Select all

$ sudo systemctl enable watchdog
I now also have a "watchdog" process running as well as "watchdogd."

I have dtparam=watchdog=on in /boot/config

However I see no bcm..whatever watchdog module loaded with lsmod. And I see no sign of any such module in /lib/modules which seems very odd.

How can I continue with getting this watchdog to work?
Memory in C++ is a leaky abstraction .

epoch1970
Posts: 5131
Joined: Thu May 05, 2016 9:33 am
Location: Paris, France

Re: Using the watchdog.

Fri Jun 23, 2017 12:22 am

This is on a current machine running watchdog. I just installed the watchdog package and didn't care about config.txt

Code: Select all

admin@berck:~ $ ps xawu | grep dog
root        36  0.0  0.0      0     0 ?        S<   Jun06   0:00 [watchdogd]
root       242  0.0  0.0      0     0 ?        S    Jun06   2:12 [brcmf_wdog/mmc1]
root      8039  0.0  0.1   1888  1760 ?        SLs  Jun06   1:29 /usr/sbin/watchdog -v
root     10344  0.0  0.3   5456  3384 ?        Ss   Jun06  17:06 /bin/bash /usr/local/bin/dom2_watchdog/wdtest.sh
admin    32118  0.0  0.2   4280  2016 pts/0    S+   01:58   0:00 grep --color=auto dog

admin@berck:~ $ sudo systemctl stop watchdog 
admin@berck:~ $ sudo systemctl start watchdog 
admin@berck:~ $ grep watchdog /var/log/syslog
...
Jun 22 21:45:51 berck watchdog[8039]: still alive after 138240 interval(s)
Jun 23 02:04:06 berck systemd[1]: Stopping watchdog daemon...
Jun 23 02:04:06 berck watchdog[8039]: stopping daemon (5.14)
Jun 23 02:04:11 berck systemd[1]: watchdog.service: control process exited, code=exited status=1
Jun 23 02:04:11 berck systemd[1]: Stopped watchdog daemon.
Jun 23 02:04:11 berck systemd[1]: Unit watchdog.service entered failed state.
Jun 23 02:04:11 berck systemd[1]: Triggering OnFailure= dependencies of watchdog.service.
Jun 23 02:04:11 berck systemd[1]: Starting watchdog keepalive daemon...
Jun 23 02:04:12 berck wd_keepalive[32361]: starting watchdog keepalive daemon (5.14):
Jun 23 02:04:12 berck wd_keepalive[32361]:  int=10 alive=/dev/watchdog realtime=yes
Jun 23 02:04:12 berck wd_keepalive[32361]: watchdog now set to 15 seconds
Jun 23 02:04:12 berck wd_keepalive[32361]: hardware watchdog identity: Broadcom BCM2835 Watchdog timer
Jun 23 02:04:12 berck systemd[1]: Started watchdog keepalive daemon.
Jun 23 02:04:16 berck systemd[1]: Stopping watchdog keepalive daemon...
Jun 23 02:04:16 berck wd_keepalive[32361]: stopping watchdog keepalive daemon (5.14)
Jun 23 02:04:17 berck systemd[1]: Stopped watchdog keepalive daemon.
Jun 23 02:04:17 berck systemd[1]: Starting watchdog daemon...
Jun 23 02:04:17 berck watchdog[32410]: starting daemon (5.14):
Jun 23 02:04:17 berck watchdog[32410]: int=10s realtime=yes sync=no soft=no mla=24 mem=5
Jun 23 02:04:17 berck watchdog[32410]: ping: no machine to check
Jun 23 02:04:17 berck watchdog[32410]: file: /run/wdtest:180
Jun 23 02:04:17 berck watchdog[32410]: file: /run/wd-ping:1800
Jun 23 02:04:17 berck watchdog[32410]: file: /run/wd-ovpn:600
Jun 23 02:04:17 berck watchdog[32410]: file: /run/wd-ntp:3600
Jun 23 02:04:17 berck watchdog[32410]: file: /run/wd-wap:1200
Jun 23 02:04:17 berck watchdog[32410]: pidfile: no server process to check
Jun 23 02:04:17 berck watchdog[32410]: interface: no interface to check
Jun 23 02:04:17 berck watchdog[32410]: temperature: maximum = 75
Jun 23 02:04:17 berck watchdog[32410]: temperature: /sys/class/thermal/thermal_zone0/temp
Jun 23 02:04:17 berck watchdog[32410]: test=none(60) repair=/usr/local/bin/dom2_watchdog/wdrepair.sh(120) alive=/dev/watchdog heartbeat=none to=root no_act=no force=no
Jun 23 02:04:17 berck watchdog[32410]: watchdog now set to 15 seconds
Jun 23 02:04:17 berck watchdog[32410]: hardware watchdog identity: Broadcom BCM2835 Watchdog timer
Jun 23 02:04:17 berck systemd[1]: Started watchdog daemon.

admin@berck:~ $ cat /etc/default/watchdog 
# Start watchdog at boot time? 0 or 1
run_watchdog=1
# Start wd_keepalive after stopping watchdog? 0 or 1
run_wd_keepalive=1
# Load module before starting watchdog
watchdog_module="bcm2835_wdt"
# Specify additional watchdog options here (see manpage).
# -q is no-act. for debug
watchdog_options=" -v"
If you lsmod you probably won't see bcm2835_wdt, it's built in the raspbian kernel (so the module loading above is useless)

Code: Select all

admin@berck:~ $ grep wdt /lib/modules/4.9.24-v7+/modules.builtin 
kernel/drivers/watchdog/bcm2835_wdt.ko
I don't think watchdog does much out of the box if you haven't configured /etc/default/watchdog and /etc/watchdog.
The beef is in configuring /etc/watchdog. I have posted this a while ago.
Summary: The dog on Pi 3 is special as it must be pat every 15s. max, so the software watchdog program should run every 10s, with RT priority, to be safe. That's fine in itself (no specific load) but you can't use IMHO a "test" program or built-in ping tests as they would run every 10 secs.
But you can use a lazy program of yours that touches a flag, and have watchdog frantically test every 10 secs is the flag was unchanged in the last 30 minutes. Then it will call a "repair" program, and you need this one because the Pi has no RTC, and after reboot watchdog can throw false alarms. Have your programs (status loop, repair) use uptime and you'll be able to detect false alarms.

HTH
"S'il n'y a pas de solution, c'est qu'il n'y a pas de problème." Les Shadoks, J. Rouxel

Heater
Posts: 15949
Joined: Tue Jul 17, 2012 3:02 pm

Re: Using the watchdog.

Fri Jun 23, 2017 1:01 am

Wow, thanks for all that epoch1970. Since I posted last I managed to work a lot of that out already.

Then I made a fatal mistake. I uncommented one of the ping test lines in /etc/watchdog.conf and rebooted.

Sure enough I'm now stuck in a boot loop because that ping fails. My stupid fault, I was too much in hurry to see if the watchdog worked. Well, it does!

Assuming I get myself out of this boot loop. How can I pat the dog from my application?
Last edited by Heater on Fri Jun 23, 2017 11:14 pm, edited 1 time in total.
Memory in C++ is a leaky abstraction .

Heater
Posts: 15949
Joined: Tue Jul 17, 2012 3:02 pm

Re: Using the watchdog.

Fri Jun 23, 2017 1:10 am

Oh crap. This is horrible.

So I let the Pi run around in a boot loop for a few minutes as the watchdog keeps rebooting it.

Sure enough eventually it fails to reboot. Totally wedged.

After a power cycle it was back to the boot loop. Again after a few minutes, totally wedged.

How do we make anything reliable out of this?
Memory in C++ is a leaky abstraction .

Heater
Posts: 15949
Joined: Tue Jul 17, 2012 3:02 pm

Re: Using the watchdog.

Fri Jun 23, 2017 2:04 am

OK, with the help of Paragon's ext driver for Windows http://www.paragon-drivers.com/extfs-windows/ on my Win 10 machine I disabled the ping test from /etc/watchdog.conf and the Pi 3 boots up and stays booted up again. Hurrah!
Memory in C++ is a leaky abstraction .

epoch1970
Posts: 5131
Joined: Thu May 05, 2016 9:33 am
Location: Paris, France

Re: Using the watchdog.

Fri Jun 23, 2017 8:17 am

Heater wrote:Sure enough eventually it fails to reboot. Totally wedged.
I don't think I've ever seen it fail to reboot, under Raspbian.
Note that if you set "-q" for no-act, it does *not* run the "repair" script and does not reboot.

Yes pinging at this frequency is sure to trigger reboot... On regular desktop or server platforms the watchdog device can be set to a timeout of one or several minutes, the watchdog program can directly drive the test loop and the "ping" directive makes sense. On the Pi, you really need a separate test daemon that runs at a reasonable frequency. Your link with the watchdog program could be flags in /run or /tmp.

This is part of watchdog.conf in the same machine as my post yesterday evening:

Code: Select all

# Our special keepalive files
# Our test daemon - Restart it if gone for 3 minutes
file = /run/wdtest
change = 180
# Pinging the LAN targets, eg 192.168.1.1
file = /run/wd-ping
# How often should the file change? Unit is seconds
# We don't want to reboot more than every 30 minutes
change = 1800
# Pinging the VPN tunnel endpoint. Repair does not reboot, only restart openvpn - After 10 mins out.
file = /run/wd-ovpn
change = 600
# Testing for NTPd: is it running? We try to resync clock and restart server - After 60 min out
file = /run/wd-ntp
change = 3600
# Testing for wifi AP: is it running? We try to restart server - After 20 mins out
file = /run/wd-wap
change = 1200

# Watched files throw false alarms when system time changes.
# Our repair binary takes care of that.
repair-binary		= /usr/local/bin/dom2_watchdog/wdrepair.sh
repair-timeout		= 120
#test-binary		= # We don't use test-binary as it would loop too fast.
test-timeout		= 60

watchdog-device	= /dev/watchdog
watchdog-timeout= 15
interval		= 10
logtick                = 2880
realtime		= yes
priority		= 1
The premise is that the watchdog program never fails to run and is never killed by the OS. Running it realtime is important (esp. with a 10 sec. frequency)
The machine starts watchdog and a separate daemon called "wdtest" at boot. Wdtest watches processes and refreshes regularly a few keepalive files, I have it running its tests loop every minute. It has a keepalive file for itself (/run/wdtest), so that watchdog can call the repair program in case the test daemon was killed.

Let's imagine the ntpd process goes poof (not unlikely): wdtest stops doing "cat $UPTIME > /run/wd-ntp" regularly. The file ages. Once the file is 3600 seconds old, watchdog calls "/usr/local/bin/dom2_watchdog/wdrepair.sh 250 /run/wd-ntp". (250 is the watchdog error code for file out-of-date. On older systems you might receive code -6 instead.)
The wdrepair script gets the current uptime and checks if the uptime stored in /run/wd-ntp is about 1hr. in the past:
- If less than that, it means watchdog was triggered by a system date change (by NTP probably). It touches the flag file to avoid being called again in 10 seconds for the same, false, reason by watchdog. Then it exits 0. With exit 0, watchdog is happy with the repair and will not reboot.
- If the uptime delta is indeed 1hr, it means the test loop decided many times against updating /run/wd-ntp because it couldn't find the process. It's dead and cold. So the repair script stops the ntp service (to please systemd), runs htpdate (to step the clock if needed), restarts the service, and then exits 0.

If the repair program chooses to exit 1 (anything other than 0 indeed), watchdog reboots immediately. If the repair program fails to return within 120 seconds, watchdog reboots the machine.
In my repair script, the only condition that triggers reboot (it runs "sync" and exits 1) is if wdtest was unable to ping the LAN gateway for half an hour consecutively. Wdtest runs every minute and it runs the LAN ping test every time it loops. So we only reboot if pinging the gateway has failed 30 times in a row...
Last edited by epoch1970 on Fri Jun 23, 2017 10:49 am, edited 1 time in total.
"S'il n'y a pas de solution, c'est qu'il n'y a pas de problème." Les Shadoks, J. Rouxel

Heater
Posts: 15949
Joined: Tue Jul 17, 2012 3:02 pm

Re: Using the watchdog.

Fri Jun 23, 2017 9:00 am

epoch1970,

Thank you. I'm a bit to tired to comprehend all that you have written there. I will try again in the morning.

But, good grief, can it be so hard?

If my program does not pat the dog in some time, I want the machine to be rebooted. Immediately. As surely as if I had shut the power and turned it back on again.

You know, a watch dog.
Memory in C++ is a leaky abstraction .

Heater
Posts: 15949
Joined: Tue Jul 17, 2012 3:02 pm

Re: Using the watchdog.

Fri Jun 23, 2017 11:36 pm

Me,
How do we make anything reliable out of this?
As it turns out it is almost certain that my reboot failures were down to the age old crappy USB power supply problem. I now have it hooked to a supply that can deliver four amps and is wound up to 5.4 volts. At the other end of the USB cable is only about 5.1 volts. It's been rebooting itself happily for about an hour now.

I have yet to comprehend all that you have written above, epoch1970. As far as I can tell there is no way to "pat the dog" when using the watchdog daemon. Rather the watchdog pats things and checks they are still operational. So here is my plan:

Create a test binary that the watchdog will call, whenever it does.

That test binary will send a UDP packet to my application, effectively a request for status.

The application will reply with a UDP packet containing its status. Basically OK or not, true or false, zero or one.

The test binary will return 0 or 1 to the watchdog depending on the returned status. Of course the test binary will return non-zero if there is no reply.

Does that sound like a reasonable approach?

I'd like to avoid using any flags in files for the watchdog.
Memory in C++ is a leaky abstraction .

epoch1970
Posts: 5131
Joined: Thu May 05, 2016 9:33 am
Location: Paris, France

Re: Using the watchdog.

Sat Jun 24, 2017 10:21 am

So here is my plan:
Create a test binary that the watchdog will call, whenever it does
...
And whenever is "every 10 seconds or less if you want to be safe and have set the hardware watchdog to its max timeout value of 15 seconds."
So every 10 secs, watchdog will fork your bit of shell script that queries your server, and swiftly exit 0 or 1 according to what you want the watchdog program to do next.
It makes perfect sense, I've had setups working like this. Only in my case the hardware watchdog was set to 90 secs. and the software watchdog was calling the test program every 60 secs.

Your plan will probably work, and load might well be minuscule. But I still prefer, on the Pi, using the built-in "file" test in conjunction with a repair program.
Yes, the system time/uptime confusion creates potentially false alarms. But otherwise it is extremely simple and the "file" is in RAM, so it is reliable.

I'm glad you're on the right track!
"S'il n'y a pas de solution, c'est qu'il n'y a pas de problème." Les Shadoks, J. Rouxel

Return to “Raspberry Pi OS”