Detecting failed hardware in software

As a summertime project to keep myself busy, I’ve been writing a ‘Health Checker’ which was partly inspired by 2767’s (gave a visual display of amperage testing on drive train motors).

While amperage testing and comparing may be something to work on in the future, I’m currently working on detecting if hardware is present.

My current way of detecting errors is checking if the ErrorCode returned by most TalonSRX API calls is ‘OK’ or not.

After the combination of doing some reading and what I like to call the ‘anti-tip-the-robot-over-encoder’ being accidentally unplugged and causing a scare at the build site, I’ve realized that this isn’t enough to determine if an encoder is present.

What are some ways teams are checking their hardware (minus getting up and looking to make sure their plugged in; that’s too easy) in software?

Thanks in advance

Forgot to add, our teams README provides examples to how we are checking for errors as of now.

One of the first things I’d toss out is “Why”? What’s the end goal of doing checking? Defining this will help scope what things are reasonable to check versus not.

In general, any verification requires redundant pieces of information which can be used to verify a condition. Two pieces are required to detect an issue. You can see pieces of info indicate mismatching conditions, clearly one must be wrong. However, you won’t know which is wrong. Three are required to “correct” for it (ie know which piece of info was incorrect).

Any datalink (SPI, I2C, CAN) device should have either a status/ID register, or at least an expected communication sequence. If this cannot be read properly at startup, you can set a flag to indicate the device is faulted and unavailable.

Since motors are expected to draw current, you can compare the actual current against some expected value.

Same for encoders - if current is flowing to the mechanism, you can deduce some motion should be ocurring. The better you know the mechanism, the tighter constraint you can put on what motion is reasonable.

FYI One other example from industry: Lots of automotive speed sensors (encoders) use anelectrical-current based protocol, rather than just voltage like FRC encoders.
~20mA = “High”
~4mA = “Low”
0mA = Unplugged (fault)
I would believe this sort of signaling is beyond standard FRC capabilities right now, unless you design your own breakout board?

But, once you detect a fault, what do you do? You can lock out certan functionality of the robot. You could just display a warning to operators, or just pit crew even. You could do nothing. You could use an alternate method of calculation. This is where it wraps back around to the first question: What’s the end goal?

FMEA may be of interest to you?

I don’t know about OP’s intentions, but this sounds like it would be a cool thing to run in Test mode prior to a match. You could start with your checks on the SRX’s to ensure they’re there and functioning. Then run some basic motion checks by running each motor/subsystem. If there is an encoder on one the mechanisms you could look for its readings. If it doesn’t report something back, you either have a motor not moving, or an issue with the encoder.

For motors without encoders you could look at current draw and compare actual vs expected; In a controlled test environment, I would imagine these values would be fairly repeatable (but don’t quote me on that, I’m by no means an expert).

If you have pneumatics, you could test firing each one individually. And should any of these tests return an error/unexpected results, you could either pause, Gault or continue the test, and post a list of errors to smartdashboard.

One of the first things I’d toss out is “Why”?

The purpose of designing this is for in-pit and pre-match identification of problems.

Our horror story behind having a fear of failing hardware was that we had an encoder fail that we used to limit drive train speed when raised to a certain height. Since that experience, I’ve been working around figuring out how we can test for failures, how can we make less mistakes, how we can eliminate more noise from our system.

I believe by creating a tool that can be used easily and reliably by my pit crew to diagnose and fix problems before they create unwanted effects helps me reduce noise in the system.

My current question is if there are ways teams are detecting sensor presence without running a test program on the robot. I’d love a srx.isEncoderPresent() call, (and I’m guessing that’s slightly out of the realm of possibility), but I figured I’d ask more experienced teams to see what the trend for doing this is.

FMEA may be of interest to you?

My team ‘does’ an FMEA for our bot, but we don’t ever use it or add more failures after they’re discovered. An FMEA could be an effective tool for this.

First off, designing your mechanisms to not necessarily need sensors is good.
Example: Using an FP loop for velocity on your drivetrain will ensure that if the joystick is at rest, the robot will not move, even if an encoder is unplugged.
We use SRX Mag Encoders on our robot, which provides dual PWM and Quadrature outputs. This means that you can easily use the checkSensorHealth() method in the Talon API to see if the encoder is plugged in. For a SRX Mag Encoder, it just checks the presence of the PWM signal, which shows that the encoder is plugged in.

I’ve played around with this idea, but never got too deep into implementing it.

If you have an effective model of your system, you can compare real values with modeled values (ie the difference between your estimator and observer) and report if this crosses a specific threshold.

Something I tried to implement on last year’s robot after an elevator motor failure, was to check the current draw from pairs of grouped motors and report if one was significantly different than the other. This is pretty simple and could save a match, but my implementation never worked because I didn’t have time to debug it.

A couple years ago we implemented a check to make sure that an encoder was plugged in and working on our lift. As I recall, when it started up we checked to make sure it was reading within a specified range. If it was not, we defaulted to manual control of the lifter rather than letting it lift until a certain value was reached. This was done after a couple of practice incidents where the motors did not stop lifting until the limit switch at the hard stop. If manual mode was engaged it would lift for 2 seconds and the a warning displayed. (I think we also made an LED on the robot blink red.)

We ran the pit check diagnostic on all motors by applying a set voltage and recording current and speed and comparing them to historical values. This normally uncovers bad motors, tight/loose belts, and binding mechanisms. In 2017 it found 775s that were about to fail and in 2018 found a suspicious elevator problem.

Maccopacco, I believe you’re already familiar with this process.

Due to the violent nature of a failed elevator encoder (elevator would attempt a sub orbit trajectory), we would check to make sure the elevator encoder reading was progressing as expected (every 100ms I think) during auto and tele and if we saw no encoder movement then disable the elevator. This could be applied to any critical mechanism that, if you lose the encoder, bad things happen.

I agree, a drive and programming team philosophy we have is to have absolute control over our bot and have things fail in the desired state.

We use positioning on our systems and we have safeties, but we always have the option for percent vbus control and overrides for all safeties.

Thank you! This is what I was looking for! Next time I’m at our build space I’ll test the responses of this.

I checked out the code on your teams Github, could you point me to where in your attempt is? I don’t know a ton of Python, but I probably figure out enough to get an idea of where to start.

I like the concept of continually checking the system. It’s got me thinking about writing auto in closed loop and open loop for redundancy.

Here’s the link.

I’m not sure what I was thinking when I wrote this…

The idea was to calculate the percent difference, ie |master - slave| / master and check if this was outside a certain bound, but I see about 3 bugs just looking at it now. I ended up disabling it because it was reporting false positives (imagine that) and never got a chance to properly test what amount of current deviation was “normal” or even if this was a reliable indicator of failure.

For this system specifically we had it on a VP dual input, and the pinions came off at one point before loctiting the set screws. Ideally this would detect that, as our next-best test was supplying a small amount of voltage and having a freshman hold the backshaft (as a loaded motor wouldn’t rotate, but the unloaded one would.)

Detecting failed hardware in software is one thing, but I’m a much bigger fan of detecting failed software in hardware. Ex. (padded) hard limits, properly-sized circuit breakers, using Talon features like current/position/voltage limits.

The problem with software that second guesses hardware is that you’ve just increased the complexity of your software (and therefore the likelihood of a software failure).

My team has done this for a long time. Here’s something to check motors based on currents from several years ago:

This has only occasionally been useful. Most of the time people have just ignored the output of this check and some years the output of it hasn’t even gotten plumbed anywhere user visible.

That being said, there have been times when it has been helpful. For example, one time we had the robot wanting to turn a little bit to the left for no apparent reason and we were able to find quickly that there was a problem with one of the Anderson connectors to one of the motors. That being said, we probably would have found it not too much slower just looking at current readings.

asid61 and Mark Wasserman, you’re answers led me to what I was looking for.

mIO.encoderValid = elevator_1.getSensorCollection().getPulseWidthRiseToRiseUs() != 0;
neutral |= ((mState.usesClosedLoop || mWantedState.usesClosedLoop) && !mIO.encodersValid);

//Encoder not present or too high
if (!mIO.encoderValid || pos > kElevator.TOP_ROTATION){
        driveModifier = kElevator.SPEED_LIMIT_SLOWEST_SPEED;
//Encoder value good, limit
} else if (pos > kElevator.BOTTOM_ROTATION) {
         driveModifier = 1 - (pos / kElevator.TOP_ROTATION) + kElevator.SPEED_LIMIT_SLOWEST_SPEED;

//Encoder value lower than limit
} else
	driveModifier = 1;

if (!mIO.encoderValid), AlertLevel.ERROR, "Encoder not detected");

This accomplishes what I was seeking out to do, which was be able to alert the pit crew if encoders are not present, not let closed-looping go haywire, and have our speed limiting fail in a safer state than flipping.

I expect to revisit and revive this thread when I get more into current tracking and testing on individual motors to determine other issues with the bot, but for now, thanks to all who have helped!