Heater wrote:Sure enough eventually it fails to reboot. Totally wedged.
I don't think I've ever seen it fail to reboot, under Raspbian.
Note that if you set "-q" for no-act, it does *not* run the "repair" script and does not reboot.
Yes pinging at this frequency is sure to trigger reboot... On regular desktop or server platforms the watchdog device can be set to a timeout of one or several minutes, the watchdog program can directly drive the test loop and the "ping" directive makes sense. On the Pi, you really need a separate test daemon that runs at a reasonable frequency. Your link with the watchdog program could be flags in /run or /tmp.
This is part of watchdog.conf in the same machine as my post yesterday evening:
Code: Select all
# Our special keepalive files
# Our test daemon - Restart it if gone for 3 minutes
file = /run/wdtest
change = 180
# Pinging the LAN targets, eg 192.168.1.1
file = /run/wd-ping
# How often should the file change? Unit is seconds
# We don't want to reboot more than every 30 minutes
change = 1800
# Pinging the VPN tunnel endpoint. Repair does not reboot, only restart openvpn - After 10 mins out.
file = /run/wd-ovpn
change = 600
# Testing for NTPd: is it running? We try to resync clock and restart server - After 60 min out
file = /run/wd-ntp
change = 3600
# Testing for wifi AP: is it running? We try to restart server - After 20 mins out
file = /run/wd-wap
change = 1200
# Watched files throw false alarms when system time changes.
# Our repair binary takes care of that.
repair-binary = /usr/local/bin/dom2_watchdog/wdrepair.sh
repair-timeout = 120
#test-binary = # We don't use test-binary as it would loop too fast.
test-timeout = 60
watchdog-device = /dev/watchdog
watchdog-timeout= 15
interval = 10
logtick = 2880
realtime = yes
priority = 1
The premise is that the watchdog program never fails to run and is never killed by the OS. Running it realtime is important (esp. with a 10 sec. frequency)
The machine starts watchdog and a separate daemon called "wdtest" at boot. Wdtest watches processes and refreshes regularly a few keepalive files, I have it running its tests loop every minute. It has a keepalive file for itself (/run/wdtest), so that watchdog can call the repair program in case the test daemon was killed.
Let's imagine the ntpd process goes poof (not unlikely): wdtest stops doing "cat $UPTIME > /run/wd-ntp" regularly. The file ages. Once the file is 3600 seconds old, watchdog calls "/usr/local/bin/dom2_watchdog/wdrepair.sh 250 /run/wd-ntp". (250 is the watchdog error code for file out-of-date. On older systems you might receive code -6 instead.)
The wdrepair script gets the current uptime and checks if the uptime stored in /run/wd-ntp is about 1hr. in the past:
- If less than that, it means watchdog was triggered by a system date change (by NTP probably). It touches the flag file to avoid being called again in 10 seconds for the same, false, reason by watchdog. Then it exits 0. With exit 0, watchdog is happy with the repair and will not reboot.
- If the uptime delta is indeed 1hr, it means the test loop decided many times against updating /run/wd-ntp because it couldn't find the process. It's dead and cold. So the repair script stops the ntp service (to please systemd), runs htpdate (to step the clock if needed), restarts the service, and then exits 0.
If the repair program chooses to exit 1 (anything other than 0 indeed), watchdog reboots immediately. If the repair program fails to return within 120 seconds, watchdog reboots the machine.
In my repair script, the only condition that triggers reboot (it runs "sync" and exits 1) is if wdtest was unable to ping the LAN gateway for half an hour consecutively. Wdtest runs every minute and it runs the LAN ping test every time it loops. So we only reboot if pinging the gateway has failed 30 times in a row...