[AR] failure recovery (was Re: Re: Flight Controller Features)
- From: Henry Spencer <hspencer@xxxxxxxxxxxxx>
- To: Arocket List <arocket@xxxxxxxxxxxxx>
- Date: Sun, 17 Jan 2016 23:20:10 -0500 (EST)
On Sat, 16 Jan 2016, Norman Yarvin wrote:
...if you really needed a piece of state, that might not be enough; you
might have to save two copies of it, each with a checksum, and refresh
them alternately.
Which is actually a rather trivial increment in complexity over having
just one save area, i.e. you might as well do it if you're saving at all,
unless the save memory is a very limited resource.
That level of paranoia, worrying about rare subcases of rare cases, is
worth it when one is writing an operating system to be used by
millions of people. For amateur rockets, not so much: you're not
going to have literally trillions of tries at hitting the ultra-rare
behavior.
Assuming, that is, that you have guessed right about that behavior being
"ultra-rare". Usually the problem is not that you drew the ace of spades,
but that your estimate of the probability was wrong. Apollo program
alarms were supposed to be "can't happen" behavior, *never* happening in
flight, yet the probability that Apollo 11, as flown, would have them
during descent was actually 100%.
The issue is not whether it makes sense to take precautions against some
specific bit of supposedly-ultra-rare behavior, but rather, whether it
makes sense to take generic precautions against surprises, rather than
just having the software throw up its hands and give up when something
weird and unexpected happens. Weird nonsense does happen. Surprisingly
often, simple attempts to recover and carry on actually work adequately
well for most cases, so they are worth a bit of thought and effort.
In fact, generic precautions are what you *want*, not only because the
individual types of weird nonsense seem so unlikely, but also because
attempts to automatically *diagnose* the exact problem are notorious for
making mistakes. (The problem wasn't an expected one, or the symptoms
weren't as expected, or the rarely-exercised diagnosis code itself had
bugs.) It's not only easier, but also more robust, to just punt all the
weird nonsense to generic "standardize state and try to resume" code.
How much effort this is worth, is an engineering decision. But assuming
that it's zero without even thinking about it is usually a mistake.
Henry
Other related posts: