SpaceX’s recent interesting test result of the F9R got me wondering about fault tolerant design. SpaceX said the loss of the rocket was a single point failure of a sensor, where on a Falcon 9 enroute to orbit, that sensor would have been voted out.
I deal with similar architectures in my day job quite a bit. Some parameters (airspeed, for example) are very important to how things fly. If you are actually at 400 knots, and the airplane thinks it is at 150 knots, you will get an unexpected response!
The idea of a voting scheme is pretty straightforward. A simple FRC example would be having a very low geared arm with 3 potentiometers. If potentiometer A says it is at 120 degrees, potentiometer B says it is at 120 degrees, and potentiometer C says it is at 360 degrees, potentiometer C is most likely wrong, and the robot will vote it out and act as if the arm is at 120 degrees. If you had a single bad potentiometer, your low geared arm might rip the robot apart!
Another common FRC example is to have a potentiometer & limit switches. That way if your closed loop feedback can’t keep up, the limit switch will protect your structure.
We’ve done the potentiometer and limit switch before. What have others done? I bet there’s some pretty fancy fault tolerant magic out there…
If you take a look at how the Poofs determine which goal is hot, in both their CheesyVision and their on-robot sensor implementation they count votes from each of the sides. A goal is only hot if the vote difference favors that sensor or that hand by some minimum threshold (something like 10).
Whereas just asking once “Which side is hot?” returns a result with considerable noise, they are effectively asking many many times and aggregating these results to extract the signal from the noise.
It is common, in industrial applications, to have soft and hard limits. The soft limits are software/firmware inputs but the hard limits turn power off and implement safety protocols. Duplicate analog sensors, encoders etc are also common. In FIRST the rules kinda prohibit “hard” limits. And duplicating other sensors is kinda expensive.
This year we doubled up the limit sensors on our cam-based catapult. Not knowing when the cam was in the launch position was bad news!
The TechnoKats’ 2014 robot has three completely independent sensors on its giant flingapult to detect “end of travel”, plus a software-enforced time limit on how long it can be powered forward. There’s an optical sensor that gets a reflection from the carriage as it goes by, a physical limit switch adjacent to the padded hard stop, and a gear tooth sensor to count distance from the home position. Any two of them can fail, and the flinger arm will still stop before trying to break something.
On previous years’ robots, there have sometimes been single sensors that can cause either loss of control or self-destructive behavior if they get disconnected. We typically tried to detect such malfunctions and disable the function in software, but without redundant sensors that’s not easy.
*You can get some fault detection even with a single sensor by doing range and rate sanity checking.
Examples:
You know a pot should be reading between 90 and 120 degrees, but it returns a reading of 10 or 200 degrees.
You are supplying power to a motorized subsytem and it should be moving but the position (or speed) sensor says that it is not.
Increasingly, “smart actuators” are being used on automobiles. These actuators have a tiny microcontroller to control and monitor the actuator’s performance and report trouble codes via the CAN bus.
Fault tolerance can be applied to purely mechanical systems as well, take for example spring-biased pneumatic cylinders. These are double acting cylinders that have an internal spring that forces them into the design extended/retracted configuration.
As a team that uses a large number of pneumatic components we are constantly concerned with question of what happens if we loose pressure mid match.
Here is the basic philosophy of the above diagnostic.
If:
The PID has a reasonably large error input (error = setpoint - process_variable) (i.e. the appendage should be moving since the PID is commanding the motor)
AND
The speed of a appendage as measured by the sensor is very low (i.e. the appendage is NOT moving)
AND
The above 2 things occur continuously for a certain period of time.
Then:
The sensor is broken or disconnected either electrically or mechanically, so shut off the motor (or just shut off PID control and revert to manual control).
We also do simple out-of-range checks that will also shut down the motor.
Carefully choosing “NO” and “NC” switches can be an important decision. I.e. in 2013 we had NC switches to detect when we ran into the bridge that helped the robot self-align by slowing or disabling forward motion on one side of the drivetrain. Well… if one switch got damaged or disconnected the robot wouldn’t drive forward! We implemented a manual override because we could do that quickly in code (i.e. the sensors exerted no authority until the operator said they could) but a more robust solution would have been NO switches.
We do like our manual override options. If we see the robot start to do something it’s not supposed to do we usually have a switch or a button to turn off any automated processes and turn complete control over to the operator. This is, of course, a double-edged sword, how far to push a potentially vulnerable robot is up to the driver’s judgement.
One simple, hard learned lesson is to use timers to backup any automated action that depends on a sensor. For example, drive forward until the range sensor says you are at a specified range. A timer backup that turns off the drive forward after a certain time can prevent field and robot damage.
In 2013 we used a speed sensor for our shooting wheels in autofire mode. One time in practice the retroreflective tape was on the wrong side of the wheel. So the optical sensors never registered it had enough speed. So in autonomous there was a timer of 1.5 seconds that would override the sensors.
118 in 2012 doubled up sensors on their robot. I’ve never investigated whether or not they continued to do that through 2013 and 2014. Maybe someone on that team could shed some light on the cost/benefit.