[AR] failure recovery (was Re: Re: Flight Controller Features)

  • From: Henry Spencer <hspencer@xxxxxxxxxxxxx>
  • To: Arocket List <arocket@xxxxxxxxxxxxx>
  • Date: Sun, 17 Jan 2016 23:20:10 -0500 (EST)

On Sat, 16 Jan 2016, Norman Yarvin wrote:

...if you really needed a piece of state, that might not be enough; you might have to save two copies of it, each with a checksum, and refresh them alternately.

Which is actually a rather trivial increment in complexity over having just one save area, i.e. you might as well do it if you're saving at all, unless the save memory is a very limited resource.

That level of paranoia, worrying about rare subcases of rare cases, is
worth it when one is writing an operating system to be used by
millions of people.  For amateur rockets, not so much: you're not
going to have literally trillions of tries at hitting the ultra-rare
behavior.

Assuming, that is, that you have guessed right about that behavior being "ultra-rare". Usually the problem is not that you drew the ace of spades, but that your estimate of the probability was wrong. Apollo program alarms were supposed to be "can't happen" behavior, *never* happening in flight, yet the probability that Apollo 11, as flown, would have them during descent was actually 100%.

The issue is not whether it makes sense to take precautions against some specific bit of supposedly-ultra-rare behavior, but rather, whether it makes sense to take generic precautions against surprises, rather than just having the software throw up its hands and give up when something weird and unexpected happens. Weird nonsense does happen. Surprisingly often, simple attempts to recover and carry on actually work adequately well for most cases, so they are worth a bit of thought and effort.

In fact, generic precautions are what you *want*, not only because the individual types of weird nonsense seem so unlikely, but also because attempts to automatically *diagnose* the exact problem are notorious for making mistakes. (The problem wasn't an expected one, or the symptoms weren't as expected, or the rarely-exercised diagnosis code itself had bugs.) It's not only easier, but also more robust, to just punt all the weird nonsense to generic "standardize state and try to resume" code.

How much effort this is worth, is an engineering decision. But assuming that it's zero without even thinking about it is usually a mistake.

Henry

Other related posts: