[AR] S/W failures, restart protection

From: Robert Watzlavick <rocket@xxxxxxxxxxxxxx>
To: "arocket@xxxxxxxxxxxxx" <arocket@xxxxxxxxxxxxx>
Date: Tue, 12 Jan 2016 20:46:08 -0600

To respect Monroe's flight controller goals, I'm starting a new thread if anybody is interested in carrying on the discussion about software failures. A few questions and observations based on today's discussion:

How often in real flight systems do computers fail due to H/W issues? I'm thinking power supplies, CPU/RAM, other silicon failures, connectors, etc. Desktop power supplies die all the time but I'm talking about flight hardware. In my experience, not very often. Well, connectors maybe sometimes. Just an observation, not trying to say it should or shouldn't be planned for. Also, in my experience, having a backup processor that runs the exact same S/W as the primary is completely useless because the backup usually fails for the same reason as the primary (unexpected data uncovered a bug). And there are sometimes bugs that crop up in the process of failing over to the backup that aren't sufficiently tested.

While general purpose OSs have come a long way, I don't necessarily agree that they're as good as an RTOS. Are there any flight control systems for real airplanes and rockets that don't use an RTOS? The lots of eyeballs thing is a fallacy - I've had enough Linux driver headaches to know that developers slip broken stuff into the kernel all the time. For safety critical use of an RTOS, you typically pay the vendor a bunch of $$$ to perform additional testing and certification. And then you lock it and the toolchain down for *years* and don't upgrade it unless you really have to. Some bugs will invariably be uncovered along the way but I'd rather have a stable system with known bugs than a something with all new bugs. Don't get me wrong - I'm a big Linux fan but I wouldn't consider it for anything safety critical. I see mystery pauses all the time in non-RTOS systems. What are you going to do when you have a sequence that has to happen within a few ms and some process hogs all the CPU time for a couple of ticks? Having said that, is it good enough for an amateur rocket? Yes - definitely. I'm using an RTOS primarily because I thought it would be interesting to play with various architectures in a realistic environment.

I wouldn't necessarily discard the idea of a VM. In certain cases, it could make integration of existing capability much easier. Just run it on another processor or time and memory partition separate from the important stuff. There's lots of existing code (offboard file transfers, stack checking routines, logging for example) that can run at lower priorities and typically don't have deadlines. Heck, you could even run a web server on your vehicle in a VM if you wanted to. Not everything needs to run real time.

-Bob

Follow-Ups:
- [AR] Re: S/W failures, restart protection
  - From: Edward Cree
- [AR] Re: S/W failures, restart protection
  - From: George Herbert
- [AR] Re: S/W failures, restart protection
  - From: Robert Watzlavick

[AR] S/W failures, restart protection

Other related posts: