Canbus utilization, CanCoders, “CAN frame not received/too-stale”

My team switched over to CTRE canCoders from normal mag encoders. We had 17 can devices that is now 21 devices. After the change, we are getting “CAN frame not received/too-stale”, our can utilisation is 80-100% and our tx full consistently. To ensure there is no physical electrical problem I tested the canBus incrementally and only got “CAN frame not received/too-stale” as I reached about 90% of the whole can loop: where utilization was starting to hit 100% and TXfull started to increment. The errors got more frequent as I added the last few devices. I also inspected the Canbus and saw no physical damage. Furthermore, the overall resistance of the can loop from robo RIO to PDH is ~6 ohm for can low and can high.

I was under the impression the can bus should be able to support 32 devices. I am aware, with these CanCoders we are doing a lot more costly reads on the bus but is the only solution here to decrease the number of can devices? Or is there a different underlying problem?

Thanks for any help.

1 Like

I suspect you have a problem with your code. Start commenting out devices individually and see what is causing the bottleneck. We’ve had to do this a couple times and found stray PID sets and other items that create a large increase in the Can traffic. You can also check with default code that just declares the devices on the system and nothing else to verify it’s a code issue.

2 Likes

Hi Tom,
I’m the software guy – When we shorten the Canbus (exclude a few motors and Cancoders), our utilization goes down and we don’t have tx errors. For some reason however, we only seem to get error -3 “Can frame not received/too-stale” only for the Cancoders which is strange considering that I’m getting and setting values for ~ 8 falcons consistently every 20 milliseconds. I think its a code issue, but I’m not sure how I could reduce the number of getting and setting of values I get over Can.
Thanks!

1 Like

In our experience, you’re going to start seeing problems when you go above 85% utilization. But, just by adding devices to your bus and using a set once per loop you shouldn’t be seeing near that. We have more devices on the can bus than you - 23, and we reside around 70-75%.

I suspect you’re doing too much in code.

You need to go through your code and methodically eliminate which area is causing the problem. Then you can zero in on the root cause. Whether that’s by removing references in code, or by commenting sections out.

In our case (Labview) we would put disable structures around blocks of code until the utilization dropped, then go after whatever we disabled last.

1 Like

So I first methodically commented out any area where I was setting / getting values from any device related to canvas. This did not change anything, as the utilization just stayed at 70%-100% and the tx still incremented.

I then deployed a template code from timed robot and the utilization has gone from oscillating from 100% - 70% to around 100%-50%, but now there are no more tx errors.

I also tried just solely initializing all of my talonfx’s just in the robot.h class of the blank robot code project and that brought up the utilization back up to 100%-70% and the tx started to increment again at around the same rate. I did not call any function regarding the WPI_TalonFX objects.

Unfortunately the language gap prevents me from going any further - I haven’t programmed a robot in anything but LabVIEW.

Perhaps other people using your language can comment on their Canbus utilization.

Have you read this page that discusses how often each device sends data and how to change that rate? Common Device API — Phoenix documentation

I wouldn’t be surprised to see more problems with CANCoders and other sensors instead of motor controllers given that the FRC CAN specification gives more priority to motor controllers.

1 Like

CAN bus bandwidth is a fixed/limited resource. If utilization is high, expect to see errors of this type.

You can manage bus utilization in a couple of ways – everything on the bus is using some of this resource. Each device uses some to periodically send packets of information; most of these have settings to allow this usage to be adjusted. The roboRIO uses this resource; you can make changes in your code to influence how much.

For most teams, it’s hard to see where the bus bandwidth is going. This is why it can help to disable devices in your S/W and/or take them off the bus – it lets you reason about how much bandwidth each device is using so you can focus on the ones that are using the most. You can also influence which devices do not get to send data (to an extent) by the CAN IDs you assign. These essentially prioritize some devices ahead of others. You can also generally do different things to help handle occasional errors of this type, in your code.

It might be helpful to list everything you have on your CAN bus and describe you code in more detail (language, what types of control are you doing, etc.).

1 Like

thanks for the info,

We have 17 falcon 500s, 4 CanCoders, Rev PDH(2021), and REV PCM (2021), RoboRio2 using v4 software using c++, and an MK4 8 falcon swerve.

Today we did some more testing and even when uploading a blank file, we were still at 100%. It oscillates from ~60-100. Even after disconnecting the cancoders(when the issue became apparent) the rest of our system still oscillated to 100%.

Some tests/ data we did (each % is the max value we saw):

1 falcon on can: ~7% utilization, however it was still oscillating and when we filmed the utilization on driver station in slow mo, we saw the % spike to 17% !

2 different swerve units w/o cancoder: ~32 max% (same oscillating behavior)

everything BUT drive base (excluding: 8 falcon 500s and 4 canCoders): 56 max% for just 9 motors! (same oscillating behavior)

Again, this was all with a blank code file.

Since CAN works when Motors are powered off, i started pulling breakers to see if somehow there was EM interference somewhere but no change - just a linear decrease in utilization when devices were cut off power. However, still oscillating usage that matched our tests above.

After seeing this troubling data we tried a robo rio 1 with older firmware and got a promising result where we saw 76% utilization for the entire robot and no crazy oscillation to 100%. Unfortunately the regulations require the newest firmware:

Rule update 8:
(https://firstfrc.blob.core.windows.net/frc2022/Manual/TeamUpdates/TeamUpdates-combined.pdf)

so we updated the roboRio 1 hoping it was somehow a roboRio2 issue. But after updating - the roboRio1 also got the oscillating, high can utilization…We made sure our REV firmware and ctre devices were up to date and still no change.

At this point - I feel like we’ve figured out that all the devices are contributing to this high utilization. They are all in this spiky behavior of oscillating between a reasonable can usage to a high can usage. And how its being caused by some firmware issue and not an electrical issue…

Would it be possible for you to explain more about "You can also influence which devices do not get to send data (to an extent) by the CAN IDs you assign. " I understand that the lower IDs might get priority but there is a way to lower data for devices?

Another datapoint is how we can see CanBus packets through the REV hardware client. Is there a way to understand those to help troubleshoot/ diagnose?

Thanks, sorry for delayed response.

4 Likes

The oscillation is normal - I’m not sure what they changed since the last time I stared at CanBus (yeah covid!) but we’ve seen that 30% oscillation constantly as well. However, we have the same number of devices on the network as you and aren’t seeing the high utilization. Do you have ANY other canbus calls other than motor sets in your code? And you are certain that you are only setting those 21 etc devices once every 20 ms?

2 Likes

The noise in the CAN utilization in the DS is described in the known issues here: Known Issues — FIRST Robotics Competition documentation

3 Likes

I think there may be something else going on, but one thing you can try is to run the PID loops for swerve turning onboard the falcons while homing position at the start of the PID loop off the CANCoder. See our code from this year for an example. Also, it might be worthwile for you to share your code so we can see if there’s anything unusual in there like what @Tom_Line mentioned.

At any point in time, the CAN bus is either being utilized, or it is not. So, the percent data is based on sampling and, depending on the sampling interval, is expected to be somewhat noisy (even without the issue linked above). You are getting CAN-related errors, and you have a lot of devices – utilization is clearly an issue here. You can use the data in the graph to track changes, by comparing this data as you perform experiments. Also keep track of CAN errors.

This might be a good use case for a CANivore, if you are in a position to throw hardware at the problem. This device is new, and I have zero experience with it, so I don’t want to make any promises here. This gives you a second CAN bus, and if you put the right devices on it, it supports something called CAN-FD that should effectively give you more bus bandwidth on the second bus.

Of course, be sure your CAN wiring is very solid with this number of devices (24, counting the roboRIO). The other H/W-side option might be to take some devices off of CAN, but it looks as though that would require you to use different motors (or steering angle sensors). I’m guessing that this isn’t an attractive option for you.

This brings it around to S/W and settings to help manage bus utilization. It makes sense to focus on the Falcons and CANcoders. This is a description of what is going on here on the device (non-roboRIO) side. You want to lower the rate (increase the period) for some of these periodic CAN frames being sent by the devices. In doing this, you will want to think about what each motor is doing on your robot, if is is being followed or is a follower, how it is being controlled (on board, offboard PID, offboard voltage, etc.). I’d bet you can drop the frame rate on many of your motors. You might be able to cut the rate for the CANcoders in half, but I’d focus on motors initially. If a motor controller resets (severe brownout, etc.), you’ll want to also read over this.

The other thing you can do, on the S/W side is to think about how often you are causing the roboRIO to send CAN frames to each individual motor controller. You can try things like only updating some motor controllers every Nth itieration of the main robot loop. Run experiments to measure the effect of S/W changes along these lines and then keep things that help and don’t hurt robot performance.

I don’t know what to say about the finding on roboRIO F/W. It could be there’s something in the area of the CAN bus utilization metric that is different. It could also be that this is making the roboRIO use way more CAN bandwidth, but this seems like a stretch – but certainly possible. It might be worth trying to monitor the CAN but utilization in other ways (for example, plug a USB cable in to one of the REV devices and use the REV Hardware Client to watch this).

Good luck! Let us know how things go.

4 Likes

Thanks for the info,

all these tests were with a blank code file aka no pids no nothing. We will try your code however it seems to be a non code issue. Ill talk to my software guy to maybe share our code when i get a chance.

Oh wow. Alright that shouldn’t make a difference then. Sorry if I didnt catch that!

Thanks for all of this useful info,

We will increase the period for some of the non essential motors and Ill update how that goes when I get to shop.

on the S/W side: i still find it weird how with a blank file we are at 100% utilization. I understand that motors are sending data over can and hopefully increasing the period will fix this.

Using the rev client to see CAN, we see addresses at random sending new messages pretty frequently even with our blank code file. One of our mentors thinks that behavior is pretty normal but we aren’t sure. Do you know of any documentation that would help us decode these can messages in the rev hardware client?

Thanks

1 Like

Yes, this behavior seems normal from your description.

CAN messages have a header (see here) and payload data. The payload data isn’t really publicly documented, generally speaking. The header is helpful, but the best way to get a feel for things is probably to watch a single CAN device at a time.

1 Like

I am another programmer on the team. We set the frame periods on all talons to the max length (and continually check for resets in robot periodic so we can re-set our settings). This reduces the TX Full errors a small amount (maybe 50-100 instead of 200-300), but that is still quite high. Our utilization is as high as ever too (>85%). We also get “Watchdog not fed” errors, and other various CAN timeout errors, even with only one talon initialized. Notably when we deploy an empty code file the TX Full errors and Watchdog not fed errors go away, but the utilization is still quite high (perhaps because of payload data).

What is your roborio CPU utilization? High canbus and high Cpu together create havoc - they create these error messages which actually increases CPU load and it creates a nasty circle.

In fact, I just went through tonight and combined parallel loops in code, lengthened status frames, and slowed loops. Now we’re back at 75-80% cpu utilization and no errors.

1 Like

Our CPU utilization stays around 50% even with all talons enabled, so I don’t think that’s the problem. The only logic in our test code is the logic to set the frame periods to be long if the talons reset, so I don’t think there’s any issues there.