Finding CAN bus issues?

First, some background. When NEOs became legal, we switched some of our motors to NEOs and Spark Max, in order to take advantage of the integrated encoders.

In 2019, it was fine, mostly, but everyone once in a while, our elevator, which was the only Spark Max on the system, would behave a little jittery, or would not respond. There was no obvious issue, to us anyway, but a CSA looked at our code, and our robot, and the driver station, and frowned, and borrowed a screwdriver and played with our wires and said that our wiring wasn’t so good. Tighten down the wiring. It seemed to work.

Fast forward to this year. We’ve got Spark Max/Neo combinations for all four drive wheels, plus our shooter. In two of our competition matches this year, at some point, we lost the left side drive train. We couldn’t move, unless you count spinning in a circle. The drive team checked the joystick during the match. (Tank drive style, two joystick control) Both joysticks are responding normally. After the match, they inspect. Everything mechanical looks normal. The software for this part of the drive is extra simple. Read joystick, apply deadband, set motor speed. Works like a charm all the time. Nothing funky in the software that could produce some occasional error.

Brink robot back to pits. Everything is normal. It works great.

So, we start asking, what happened? Drive team suspects software error. Of course. If you can’t see it, it must be a software error. Speaking as the software mentor, I’m confident it wasn’t a software error. No problem with chains. No problem with wheels or transmissions. Joystick was checked while still on the field. It was responding in the driver station, but the left side wasn’t moving.

Can bus error, maybe?

I check the driver station. I’m not exactly a driver station expert. I check events. No exceptions or funky warnings. Look at CAN bus utilization in the log. I see a fuzzy gray area centered around 30%. No big spikes. I don’t know what I’m looking for, just “something odd”. However, I don’t see anything odd.

It’s not something crazy like bad configuration of CAN IDs. Those would cause problems every time. Software logic errors would show up a lot more frequently in simple code like this. It happened twice, out of 28 competition matches, 5 practice matches, and several hours of operation in our practice area. (Besides which, the code for moving the right side wheels is inside the same set of brackets as the left side wheels. Java code. If it wasn’t updating the left side, it wouldn’t update the right side.) No broken chains, and no sign of sticking wheels or anything funky mechanically.

My mind, though, goes back a few years to that CSA who saw something in our logs, frowned, and started fixing wires. What was it that she saw? I really have no idea myself. I’m not sure what information there was available. All I can see is utilization, and we’re fine. What do you suppose she saw? In general, is it common to have intermittent CAN errors, and if so, what would we see in the logs if it were happening?

And, is there any other non-mechanical way of causing two motors to stop responding? In one of the two matches where the phenomenon happened, we did have an abnormally high level of packet loss, but it seems odd that it would affect only one pair of motors, and no other system. Also, in the other match where it happened, we didn’t have high levels of packet loss.

Our pneumatic systems, and the other motors (two right side motors, plus shooter) were working fine.

Any suggestions would be welcome.

1 Like

The CAN bus wiring should be a single line of devices from the Rio to the power distribution.

Could a physical interruption of the bus (yellow and green wires disconnecting) result in this failure? Are the two problem motors the “last two in the chain”, or are they somewhere in the middle?

1 Like

The REV Hardware Client displays CAN errors, when you highlight the CAN bus in the list on the left. This is from the perspective of the device you have connected via the USB cable.

I sometimes use a 'scope (Analog Discovery 2) to look for CAN bus issues, but this isn’t something many teams have on hand.

Either way, doing a wiggle/vibration test may help. CAN issues can be tough to pin down.

It does seem you’ve mostly ruled out mechanical/software, so digging in some here is a reasonable thing to do.

The issue you described sounds similar to a pinched can wire. The pressure on the pinch can fluctuate, which means that it may short sometimes, but that it won’t short other times. I know that the team I am on experienced a similar issue where a wire pinched in a nut was the culprit.

1 Like

We had these symptoms in ISR3, with the root cause being a tripped PDP breaker (adding current limits solved the problem).
Are you getting error messages about missing status frames? Are you configuring status frames?

1 Like

Can you post the log for the particular match?

1 Like

Much of -our- CAN bus woes could be traced to badly crimped locking Dupont connectors.
Another thing to consider: are you running 2019 era NEOs? The first generation had thinner encoder wires and (I think) wasn’t conformally coated. The wires can fatigue at the connectors and kids can, um, crush the entire encoder cable using climb hardware… We are -still- running the NEOs from our first purchase… I’m thinking its time for a fresh start.

There is a v2 can bus cable for the spark max that looks like a big improvement over the original. If the original connector came out of the controller, you lost everything down bus from there. The newer cable bridges the H and L.

A step in the CAN utilization range can flag loss of communication to part of the bus.

1 Like

CAN problems, especially the intermittent ones, are a PITA to solve. IMO, intermittent CAN problems usually come from either a bad component or bad connector somewhere along the CAN bus.

A good first step is to visually inspect all of the connectors and make sure nothing is loose. The dupont connectors that come standard are not up to the task in my experience and will often start to wiggle over time. If you can replace them with something more substantial (soldered, lever nut, powerpole, etc) that may help with reliability.

If you can’t find the problem visually, the only real way to find the culprit is to disconnect everything from the bus and re-connect one component at a time until something stops working. It’s difficult and time consuming, but eventually it usually works.

We had a similar issue this year. Suddenly forward was turning and turning was moving forward. Our driver was able to compensate and finish the match for a win! We went back to the pit and powered up and everything was fine. We surmised that during a hard hit we had a power glitch that reset some of the Spark Max. This set the parameters back to default and we ended up losing some inversion settings that explained the unexpected operation. We initialized and sent a ‘burn to EEPROM’ command so the power up defaults are our running defaults. The problem hasn’t reoccurred.

When I started several years back the programmers used to send a ‘burn to EEPROM’ command at the start of every match. That’s overkill but now I see why. I haven’t researched the EEPROM lifetime but that’s my only concern with that as an EE.

We use WAGO 224-412 connectors for all our CAN wiring. They have a cousin connector that is rated for use in explosive environments where human lives are at stake. As long as you get the strip length correct (the guide is marked on the side of every connector), these are great. The only CAN wiring failures we’ve had since are at the Spark MAX connector.

High level…. How do you use the scope to find CAN issues?

1 Like

It is possible to set things up to decode CAN packets, but I just put the scope across CANH/CANL and adjust the timebase and voltage, letting it auto-trigger. The types of problems I usually see show up best this way. There’s normally enough traffic that the display is showing a CAN packet most of the time.

It’s a good idea to start with a known good or very minimal configuration, to get a baseline of what to expect. You want to see pretty square pulses, all just about the same height. There will be some ringing, some rounding of corners, etc. but the signal should be well above any noise. Reflections, shorts, opens, etc. will all cause the signal to vary. You should get a solid connection somewhere on the CAN bus, then wiggle various things and look for changes in the signal. It’s even better if you can set it up so that you can move the point you are probing things or add nodes to the CAN bus while you observe, but I normally find that just having one good probe is enough to find problems, or to mostly rule out electrical problems with CAN.

The other technique is to use the REV Hardware Client or Phoenix Tuner to keep an eye on the bus (particularly the former), but these are not as direct for finding problems at this layer.

Here are a couple of articles on the topic:

https://www.tek.com/en/solutions/industry/automotive/can-bus-troubleshooting#2-wedge

https://www.edn.com/can-eye-diagram-mask-testing-for-automotive-applications/

At the end of the day though, it’s really down to knowing what a good baseline looks like and noticing when things differ from this. If there’s an intermittent problem, you can often see the pattern jump as things are wiggled. This kind of in the moment feedback is very helpful. CAN tends to hide problems, trying to do it’s job even under adverse conditions – this can make it harder to find intermittent problems, or even to reason ab out what is going on.

1 Like

Do you mean a 221-412? The 224 series does not have a 412 model that I can find.

Thank you. We have had issues with CAN reliability the last few seasons and really want to figure it out! If investing in a scope will help pin point issues, we’ll make that happen! We added a CANivore this year which has been great. I have a love/hate relationship with the red CAN LED on it.

1 Like

I’m not sure I understand the value of the oscilloscope for this application. Most (all?) CAN issues that teams face can be resolved with a few simple (though admittedly tedious and error prone) debugging steps. This should be in the WPILib docs. If it’s not, then we should add more details to it over the summer.

The problem folks face when debugging CAN is that due to the nature of being a single bus, any issues can be anywhere on the bus. This means checking wiring that runs all over the robot, many times in places that are hard to reach. It can also means this can be issues internal to any of the devices on the bus. Looking at the signal with an oscilloscope doesn’t change this difficultly.

1 Like

Well we did all the right troubleshooting steps and still couldn’t nail our issues… replaced wiring, motors, etc. brought stuff in and out of the CAN network…. Looking for a better method than trial and error.

A 'scope can tell you the nature of the problem and can spot signal integrity issues that may cause less obvious issues, such as reducing the effective bandwidth. It can also be a big help when looking for intermittent problems, since it provides more immediate feedback. I agree that it isn’t needed in many cases, and that it’s not something every team should need to have. Also, improved docs can only help – but teams miss things that are already well covered, so this is not a panacea.

1 Like

Totally understand. CAN issues are quite frustrating, our own team dealt with this quite a bit. I just don’t know that spending hundreds (or thousands) of dollars on an oscilloscope is going to make the situation any better.

e.g. with the scope you can ‘probably kinda see’ that messages look correct or not. And if your scope can decode CAN even better. But the driver station itself can tell you if there are errors on the bus or not and don’t require any fiddly bits. That’s what this section of the driver station is for:

image

e.g. if Bus Off is higher than 0, or Receive/Transmit keep increasing, you know there is a problem. This is the same information the scope will tell you, but way cheaper.

I mean I guess, but its going to tell you the same information as what the protocol already self-reports, but again, way cheaper and less effort.

Honestly I don’t see a reason for any team to own an oscilloscope unless they are making their own active PCBs.

The Analog Discovery 2 I have is $279 with an education discount, and not all that much more without the discount. It allows teams to get more into the details of what is happening on the bus, if this is something they might benefit from – for example, a team with several students who mainly do the electrical work could use a scope to learn more about what is going on under the covers, even if they don’t need to troubleshoot in this way. And, there are cheaper options still.

I agree that most of the time, there’s little point in having this. But, there are times when it’s really quite helpful, in my experience.

1 Like

It’s currently locked in our build room, but I should be able to get to it later this week. Thanks. I’ll provide it when I get into the build room and have access to the driver station PC.

1 Like