Massive Issues with CAN Networking at Week 0 Competition at Girls of Steel

Hello everyone, I’m a programming member from Team 117 who wanted to share our experience with SparkMax, NEO, and PDH issues we had yesterday.

The TX/RX Fault Errors

After various CAN issues appeared in the code, we decided to go into REV Hardware Client to debug and test each Spark on the CAN Bus. At first, the Sparks appeared fully functional (e.g. no faults, issues with firmware, or issues with CAN Data), but after plugging with USB into each Spark, the entire CAN loop had interspersed unclearable TX/RX faults that only appeared on certain Sparks. From the PDH we found that there were issues with “CAN Warning” and “CAN Bus Off” on the entire CAN Bus.

Testing Individual Spark Loops

The next step of the debugging process started with creating a CAN System at the RIO, going to a single Spark, and ending with a terminating resistor (that wasn’t the PDH). At each stage, if there weren’t any errors present, then we proceeded and added another Spark to the CAN Bus. This process allowed us to find Sparks that independently had TX/RX faults, and it also allowed us to find bad wires connected between our Sparks. However, we also found that certain Sparks didn’t have any issues when renumbered (despite having no duplicates), and in a specific case, a Spark made others throw an error after being plugged into a laptop through USB.

Conclusion

After hours of debugging with mentors from Girls of Steel (3504), we found these issues, and today we will try to solve them. If anyone has had a similar experience before we embark on another few hours of CAN debugging, we would greatly appreciate any feedback or words of wisdom.

Best,
Team 117

1 Like

This is a thing. You need to be very careful with your CAN wiring and termination. You may also need to adjust your status frame rates. Watch your CAN utilization and errors in the driver station. If your CAN utilization is too high, you can have issues even if your wiring is great. Sub-optimal wiring can actually inflate your CAN utilization.

Do a search here, there’s a lot that’s been written on this topic.

The wiring has been gone through testing each, as has the individual testing of The mc’s. What we cannot explain is why plugging aUSB cord into one specific Spark causes errors in other ones. Also, why changing the id number of the spark changes whether it has errors or not. 4 experienced mentors spent over 5 hours trying to explain the errors but were unable. CD was searched extensively for a similar problem with no luck.

1 Like

Dumb question but is the ID it’s being changed to when it has errors possibly overlapping with another ID already in use? Obviously if you had several mentors looking at it I’d assume this isn’t the issue though, but cant hurt to double check

Errors following the CAN ID can be a symptom of too high utilization of your CAN bus. This is because the ID is part of the mechanism that handles prioritizing CAN frames. What does your CAN bus utilization look like?

As for the USB thing, my initial reaction is “don’t do that” if it is causing troubles. If your bus utilization is too high, you are going to see errors and it wouldn’t surprise me if the REV H/W client is causing extra traffic that could exacerbate things.

No

1 Like

The CAN Bus Utilization averages at 55 at idle and jumps to 60 at most. There are no issues from appearances on driver station

I individually plugged into every spark and noted the IDS and found no duplicates

While this is technically true, the device ID is the final 6 bits of the arbitration id, meaning it is the last consideration for which message wins arbitration. Device type, manufacturer, and api class/index (manufacturer specific identifiers) are all considered first.

Yes, but when you have N devices of the same type from the same manufacturer on the bus, it can (and does) come into play. It would be interesting to know which IDs are in use, and which tend to have the worst problems.

Not sure if “no issues with firmware” means everything is on the latest versions or not. But it’s worth double-checking that you have latest REV H/W Client and latest firmware files downloaded and on the devices (Power Distribution, Pneumatic, SPARK MAX/Flex). It sounds as though you have all REV devices on the CAN bus, but if not, that would be good to know. Also, be sure you are current on REVLib on the coding side.

55-60% should be OK, but it might be worth adjusting the status frame periods even so – it may well help, and shouldn’t have any downside if you pay attention to what you are doing here.

Pictures of your wiring could be helpful – if the problem lies there, it will be hard to tell without the information these could provide. If you have access to an oscilloscope, that data could be very helpful.

What is the largest number you are trying to set the CAN IDs to? Setting it to large numbers has been a source of issues in the past.

The highest ID originally was 17, but we renamed a spark to 21 to test how different IDs affect the loop

All of the firmware and REV Hardware client are updated. We have a ToF sensor from playing with fusion, but we removed it temporarily from the CAN Bus for debugging purposes. I can send pictures as soon as possible if it’s helpful.

1 Like

2656 was also at the Pittsburgh Week Zero, right across the aisle from 117. We also had some wild CAN issues yesterday.

During the last hour we were there, Spark MAX 01 had stopped responding to code. It was getting power, but no signal, despite having been working less than an hour prior.

After that, we connected our PDH to the REV Hardware Client to see 4 sticky faults (Brownout, CAN warning, CAN Bus OFF, and Has Reset). The PDH is one of the newest-model units.

This is way out of normal for us. We haven’t had real CAN issues in years. We’re going to double-check all of our wires again, and try resetting the PDH.

I’ll be following this thread to see how 117 fares in their repairs, since at least part of our issues sound so similar.

CAN ID's for reference

1-4 are drive motors
5-6 are climbers
7-8 are intake and pivot
9-10 are shooter motors

From what I’ve read, this seems like a bug on rev’s side. Never have I ever seen CAN behave this weirdly. Has anyone reached out to them to see if it was a issue on their side? I know in 23 they had some CAN issues that was fixed by a firmware update.

I’ll be one of the CSAs (not sure how many are assigned tbh) at GPR week 1. If you’re still having issues, come find me and I’ll try to help troubleshoot. If you can’t find me, other CSAs may be helpful, or you can go to Pit Admin and ask for CSA “Fletch”.

3 Likes

Something on the REV stuff is possible, but I’ve generally found that things work well if the wiring is good and the bus utilization isn’t too high. I have not heard of widespread problems this year, but it’s that point in the season where it’s hard to make inferences just based on the volume of reports.

In general, I’ve seen this type of issue in years past though, so it’s probably worth keeping the information coming and see where things can go. In the past at least, this has usually not been bugs in F/W or REVlib. There were late changes this year, so who knows?

It’s not a bad idea to ask REV, but I’d do that in parallel. They are going to need the same type of information, and may be able to see it here. If not, collecting everything here will probably be helpful. “Everything” is going to include:

  • All devices on the CAN bus;
  • CAN bus topology (bus, star, etc.);
  • What’s terminating the bus, and where;
  • IDs for devices, including which devices are having trouble;
  • Results of controlled experiments;
  • CAN-related diagnostics from driver’s station, and maybe from REV H/W Client (utilization, errors, etc.).

I know all the questions are annoying, so sorry for this.

The information was sent to Rev yesterday morning. We’re awaiting they’re response.

My guess is still CAN utilization being high for the signal integrity on your CAN bus, but it can be hard to diagnose things remotely. In any case, please let us know where this goes.