Inspired by another thread here I decided to experiment with the watchdog. Luckily not on my datacenter Pi yet.
I read the docs and installed the package watchdog.
It has 2 config files: /etc/default/watchdog /etc/watchdog.conf
In /etc/default/watchdog I changed watchdog_module to "bcm2708_wdog" In /etc/watchdog.conf I first enabled lots of checks, but when that did not work I started with only:
interface = eth0 max-load-1 = 24
My problem: whenever I start the watchdog process, the system is either rebooted quickly or at most after a minute or two.
One time I even managed to get it into a reboot loop which could only be fixed by putting the flash card in a PC and destroying the configuration (renaming /etc/default/watchdog).
Something is going wrong but it is unclear to me what it is.
I tried strace of the process and the first thing I noticed is that the "interval" setting is not in seconds (as you would assume) but in half-seconds. I set it to 20 to get a 10-second check interval.
Default is 1 so a .5 second check interval. That is way too short as I cannot guarantee there will always be network traffic each .5 second interval. However, 10 seconds should be OK. (I checked a tshark -p trace and there is regular ARP and broadcast traffic)
The documentation of the program is very lacking. It has the usual "obvious" comments for each setting, like:
interval = Set the interval between two writes to the watchdog device. The kernel drivers expects a write command every minute. Otherwise the system will be rebooted. Default value is 1 second. An interval of more than a minute can only be used with the -f com? mand-line option.
Does not even mention the units of the interval... it suggest seconds but that is not true, it is half-seconds.
interface = Set interface name for network mode. This option can be used more than once to check different interfaces.
Not a word about what is really happening in "network mode".
Before I get into trouble again, and worse: before I get problems in my colocated Pi, for which I would have to send have them return the flashcard to me and have it down for a week: Does anyone have experience with this beast, and know how to make it behave correctly?
It of course should reboot/reset the Pi only when things are really wrong, but it seems a bit too eager.