DIYbanter - View Single Post - Rebooting radio in Santa Cruz mountains once a week via GitHub script?

Jeff Liebermann

On Wed, 15 Nov 2017 07:22:29 -0000 (UTC), Blake Snyder
wrote:

On Tue, 14 Nov 2017 21:17:02 -0800, in
, Jeff Liebermann wrote:

The problem is that I don't trust software solutions to fix a problem
caused by an operating system or application that has gone insane. The
reset has to be done by something that won't be trashed, hung, or
become too busy, which means an external device or independent
"heartbeat" timer.

Klugey approach?

It's spelled kludge. In the not so distant past, I helped maintain a
series of mountain top weather stations. Service calls were expensive
and best avoided. As an added bonus, this was in an environment full
of RF pollution.

Set a WatchDog Timer reboot.

Sure, if the watchdog timer is independent of what it's monitoring.
Long ago Kantronics KPC-2 TNC (packet radio modem) had a built in
watchdog timer. Too bad it was all software and located in the same
chip it was monitoring. When the KPC-2 hung, the watchdog timer also
hung. In later models, they simply removed the watchdog timer.

Roll forward a few years, and I'm maintaining servers in a big server
farm. Remote reboot via ethernet was problematic. It was quite
common to arrive at the ISP and find a message declaring that the OS
refuses to reboot until some obscure process agrees to die gracefully.
The customer got tired of paying me to reboot his servers, and I got
tried of driving 50 miles to flip a switch. So, I install a paging
receiver and decoder to initiate a reboot. That was quite a challenge
as server farms are full of RF interference.

However, even that didn't quite work. It seems that most servers have
a "feature" called WOL (Wake on LAN) that allows me to remotely power
on the server. In order to do that, it needs to have the power left
on to the LAN card(s) even when the server is turned off. (Note: WOL
is mostly used for desktops, but at the time was also appearing in
servers). Sometimes, the ethernet card would hang. If I reboot the
machine, the LAN card would remain hung. If I flipped the power
on/off switch on the server, it still would remain hung. Of course,
with no connectivity, I couldn't do a remote reboot in software.
Compaq later introduced a server management card that provided a
secondary management channel, but it was too expensive. The only good
solution was to pull the plug on the server.

For server farms, I eventually went to SNMP managed remote power
switches. I still have a bunch of APC AP9211 switches in service.
https://www.google.com/search?q=apc+ap9211&tbm=isch
Primary control is via ethernet, but some had a secondary control
channel via the serial port.

I've tried other schemes and solutions. Some worked, but all had a
surprise hidden somewhere.

Specify an IP that won't respond to pings, set up the WatchDog timer to
ping it every 24*60*60 seconds, with a fail count of 7. (or suitable
numbers that the GUI will accept).

Won't work and may I humbly suggest that you think about this a bit
more. The problem is that in any ISO layer cake device, it is
possible to have the lower layers working, while the upper layers are
hung or stuck in a spin look making the lower layers too busy to
respond. I've seen machines that respond quite nicely to ICMP pings
where the main function (email or web server) is totally hung. For
these things, you need to test the higher level services and not rely
on the lower levels.

In this situation, there's one big advantage that MIGHT make such a
simple ping work with a wireless link. All wireless connectivity is
done at layer 2 (MAC layer). The IP layer is only involved in
managing the device. If one pings by IP address, it's fairly good
assumption that the underlying layers are working. Some services such
as SNMP, SMTP for email fault notification, and the usual internal web
server might be hung, but with layer 2 still working, the wireless
bridge would likely still be doing its job.

Now, back to the original problem. Heartbeat timers and timed reboots
are a kludge. They're needed because the manufacturer of the wireless
bridge radios didn't do a decent job of keeping his hardware up and
running 24x7. The failure might be in software, susceptibility to
power glitches, susceptibility to DoS attacks, crappy components, or
environmental issues (overheating). If the wireless is bridge is so
unreliable that it has to be rebooted once per week, I suggest you
look into the cause of this unreliability, and not apply a band-aid.

--
Jeff Liebermann
150 Felker St #D http://www.LearnByDestroying.com
Santa Cruz CA 95060 http://802.11junk.com
Skype: JeffLiebermann AE6KS 831-336-2558