CAN bus freezes on code initialization

Hello,
I’m currently working on our team’s swerve software, however, our CAN bus is ocaisonally(~30℅ of the time?) freezing when the code initializes.

When this occurs, the driver station repeatedly lists CTR Error -3, CTR Error -200, and CAN frame stale for all 8 falcons and pigeon. Additionally, the driver station reports a constant CAN utilization rather than a fluctuating value. This is shown in logs by a horizontal plot. This value is not always 100℅. Driver station does not report any faults. The falcons and pigeon all continue to blink orange as per usual. Phoenix tuner shows all devices as grey.

I’m not entirely certain what changes prompted this issue, but it is far more common in my recent work than my older work. Previously I had ignored it as an erroneous occasional deploy error, but now it clearly isn’t.

Additionally, the dashboard typically reports a couple loop overruns on initialization. I assume that is normal.

Here is the link to the repository(view the “feature-checking-module-velocity-tuning” branch… I apologize for the naming conventions throughout):

Also if you’d like to see the old code which produces errors less often, view the “Dev” branch.

Finally, the error also occasionally shows up as only 2x -200 errors for a seemingly random pair of configuration calls, and then the code runs as expected from then on. I’m enclined to believe that maybe I’m just using the CAN bus a crazy amount on initialize, and should limit that by putting motor initialization in threads and having periodic calls set variables which are then gotten by other functions rather than calling for the motor’s sensor values and such multiple times per iteration(that is my assumption of what it does). I won’t have a chance to try anything until Monday though, so I figured I’d ask CD to see if anyone had answers.

Does the robot stay enabled when this happens? Anything unusual in the driver station (check all the various tabs)? What do you have to do in order to get things working again? Can you ssh into the roboRIO when this has just happened?

FWIW, a good way to initialize motors is to do this via a state machine, one config call per iteration of Periodic(). You can also check to see if the controller has rebooted, and restart the state machine if this occurs. From your description, you may be maxing out CAN bandwidth, triggering errors on some of the configuration calls. Once you start hitting error paths, odds of something unusual happening are higher.

Using multiple threads can also increase the odds of something going wrong, unless you have good concurrency control – your code uses multiple threads, but it looks as though these are used only for a couple of specific things. So, you are probably OK here, although I only had a quick look at things. However, the fact that you were motivated to use threads in this way may itself be an indication of some type of problem.

Also, make sure you have the latest set of 2022 S/W updates all around – if this is 2023 beta, you should probably report your issue as part of the beta activity.

1 Like

Can you share any Driver Station logs?

I agree that the use of threads has a smell. Can we be sure that everything we are calling in the threads is thread-safe? If not, that could cause problems that did not reproduce consistently.

I notice you’re passing in timeoutms to various methods (e.g. WPI_Pigeon2. configAllSettings). In addition to making your code block, this also makes the methods return an error code. You could try checking that error code.

2 Likes

I’m unsure if the robot can be enabled… I’ll check Monday. I didnt actually try it. I would assume it would instantly disable but I dont actually know.

Driver station reports a constant CAN utilization rather than a fluctuating one. It doesnt report any CAN errors or anything though. Connections all appear ordinary.

To make things work again I just redeploy the code and often it works just fine.

I’m not familiar with SSH. I’ve viewed the link, but what would SSH give me access to?

I think it would make sense as a solution to spread out my calls across several loop iterations, the implementation of which wouldn’t be too difficult… Is this ordinarily an issue though? Most people get away just fine with this many CAN devices right?

The threads were originally my teammate’s code which was used to initialize a NAVX, which makes sense cus you dont want a NAVX pausing your code for a couple seconds. We aren’t using a NAVX anymore so I shouldnt need it. And I threw in the thread on the swerve module class cus it seemed to fix some other initialization problem which I’m now realizing I should probably figure out the roots of. I’ll get rid of the threads and see what that yeilds on Monday.

I have everything up to date I believe. Ill double check, but I know for sure I’m using whatever the latest 2022 WPIlib was, latest talonfx/pigeon firmware, and latest CTRE library.

Thanks for the input, def has given me some directions to look

1 Like

I’ll grab logs assuming its not an easy fix Monday

Gonna take threads out as explained above. I’m not entirely sure that it would have been fine, but the only thing I think it could have messed up was odometry because that is the only aspect of the threads that involved reading/writing anything to memory that may have been read/written to in the main thread. The rest was just sending configs (and I guess receiving acks from CAN).

In addition to removing threads I’ll try catching errors… Ive never written anything to catch/output errors b4 but I figure it isn’t too confusing.

Thanks for the input!

Trying ssh is just a test to see how healthy the roboRIO is at the time, although you can poke around and look for things from the prompt, if you are familiar with Linux.

There are a couple of issues which can come up with CAN – one is bus utilization, which is a function of the number of devices and how much messaging you are doing to these. More config work can result in more messages, but the config work is often done serially, so that the number of messages per unit time doesn’t go up so much. But, a bunch of serially executed config calls can add up to blocking things for a lot of time. The CAN utilization graph is normally very noisy, so it is interesting that it goes flat. See what happens with no extra threads and let us know…

1 Like

First deploy without threads:



Second Deploy without threads:



log files:
https://drive.google.com/drive/folders/1BVrpuWKIe8CxzON2buZb7WrNHBuHCFDQ?usp=share_link

So it appears to be doing the thing where the CAN bus does not freeze, but 2 specific configurations fail. not sure if the same things consistently fail.

Poenix Tuner:

It appears to be a seemingly random pair of configuration failures…
Also, it does allow me to enable the robot and drive with this failure case

This actually seems like good progress to me – no more mysterious “CAN hang” symptoms, correct?

However, you are the only one who can really say if this is progress or not – and you may need to do a lot of testing to be confident.

If this is progress, the remaining issue would be to get rid of the errors, unless I’m missing something – again, you are in the best position to say here.

I’ll take a look at the log files you provided, in a little while. It would help to explain how and when you are doing the config work. For example, everything is done at start up, in constructors. Or, such-and-such is configured each time robotPeriodic() runs. Since you linked the code, we can figure this out – but it’s probably faster for you to say, if you know. If you didn’t write the original code, you may not… but it’s good to figure this out in any case.

1 Like

Thanks. When sharing logfiles, it’s useful to also share the corresponding event file.

Looking at the screenshots, you seem to be experiencing loop overruns in Swerve_drive.periodic(). It’s hard to tell from inspection exactly where the time is going, but I’d guess the top suspects are gyro.getYaw(), SwerveModule.getDriveVelocity(), and SwerveModule.getTurningPosition().

As an experiment, you could try replacing those three calls with zero (one by one), and see if that makes the timing problems go away. That will give us a more specific problem to focus on.

You’re also seeing some spikes in LiveWindow.updateValues(), although that might be just a transient issue. You can disable that by calling LiveWindow.disableAllTelemetry(). (This will be the default in 2023.)

2 Likes

Its progress of some sort although, I dont know what is causing the issue im seeing now.

I have all of my initialization occuring in the constructors of my subsystems. That being said, my swerve module constructor is being called from my swerve subsystem before my swerve subsystem is constructed, and my swerve subsystem cunstructor is being called from my robot container before my robot container is constructed. I dont know enough about java to know what that might implicate

Im then getting information from most devices through methods which are called periodically, often from multiple files each periodic loop iteration.
I’m not doing any periodic configuration.

@bovlb ill get the event file either this afternoon or tomorrow. I didnt know what was stored in each file. Same deal with replacing those calls and disabling telemetry

Thanks!

Just by way of background, there are basically three types of interaction with a CAN device in FRC:

  1. Retrieving information returned in the latest periodic status frame (there may be more than one type of such frame sent, these flow from the CAN device to the roboRIO and you can normally set the rate/period at which each type of these are sent);

  2. Specifying some value that will be sent from the roboRIO to the CAN device, in the next periodic frame that is sent in this direction (you can also sometimes specify this rate/period, which is independent of the rates/periods specified for frames sent from the CAN device);

  3. Request/response flows, in which some non-periodic frame is sent to a CAN device and then the code waits for a response to the specific request which was sent.

All of these frames use up CAN bus bandwidth. Frames may be dropped, if the bus is out of bandwidth at any particular point in time. However, the third type also blocks execution of the code while things wait – or until a timeout expires and the wait is given up. Electrical/wiring issues on the CAN bus can also result in frames being dropped; sometimes it is not obvious when this is happening (it may not be an all-or-nothing type of problem).

This timer expiry is normally reported as an error. In particular, there is no way to know if the device received the request or not, and if it did, what response it may have attempted to send. Even if there is no error, it takes time for the CAN frames to flow back and forth. If there is an error, it takes time for the timer to expire. During this time, execution is blocked. When this time accumulates past a threshold, you will normally get loop overrun errors. However, loop overrun only applies in the various “periodic” routines you supply as part of your code (this includes when routines provided as part of commands are being run).

Ideally, one does not do much of the request/response (case 3) activity, except as part of initialization. Or, if you do need to do a lot of this, try to manage things so that the chances of a loop overrun are minimal. If the routine you are using to interact with a CAN device has a provision for returning a possible error, it is a good bet that the routine is case 3. Configuration calls often fall into this category. So, you have to be careful about how you handle configuration.

3 Likes

That’s a really great summary. Since you bring it up, I note that I personally find it very difficult to tell the difference between (1) and (3) at the client end. If I ask for the velocity, will I get the last one received, or will I cause a request/response flow? If the vendor provides two ways to ask for the velocity, do they have the same behaviour? The documentation is seldom clear.

A lot of the time I assume I’m dealing with (1), but then I find that reading values from CAN devices causes loop overruns, so I start to suspect I’m dealing with (3).

2 Likes

If the data is in a status frame (CTRE, REV), it’s more than likely case 1. Most/all configuration parameters for both rev and ctre are case 3- if it returns an error, or has a timeout parameter, it’s probably case 3.

2 Likes

Right. I know velocity is in the status frame, so I assume it’s case 1. Then I find that calling getVelocity() can cause loop overruns, so what am I to think? With Rev’s getLastError() method (for example), any method can “return an error”, so that heuristic doesn’t narrow it down.

2 Likes

here are the event files to match the log files from before:
https://drive.google.com/drive/folders/1BVrpuWKIe8CxzON2buZb7WrNHBuHCFDQ?usp=share_link

Some of the warnings should go away if you follow the above advice. It would be interesting to see how many of these go away at that point. You are also doing the similar things in debugOdometryPeriodic(), so that might be another thing to comment out – just as a test, with nothing else commented out at the time.

It looks as though there could be multiple paths which wind up calling routines such as getStates(). Some of these call could be expensive, taking enough time that they can add up to a loop overrun when called too many times. I do not know if this is a problem – it may well be that the values being read are being stored/cached in the FalconFX object or are fast – but, one pattern that is useful in ensuring things are read a minimal number of times (and that the same value is used all throughout any given iteration of a “periodic” loop) is to first read everything you are going to use into member variables. Then you can use them all you like, without risk of making another possibly expensive call to read these. Finally, you do any “set” type operations, only once per “periodic” loop. This pattern is sometimes called “sense, think, act”.

I concur with what @bovlb said before. I think you will have to try to find the part of your periodic loops which are taking too much time and causing loop overruns. It could be that it’s something you are doing which just adds up when you do it for eight drivetrain motors, so it might not be a single thing. Since it happens only on some of the loops, I’d say CAN bus utilization could be a factor here. So, I think if it were me, I might try to bring this down.

If you are doing expensive operations a lot, this can drive up bus utilization. But there are two things I’d recommend doing regardless:

For periodic status frame rates, you can search here for advice other teams have offered. If you do not have a follower and are doing control on the Falcons, you might be able to drop these down pretty far. But it might not take much to clear things up. The first step is understanding what these settings are doing, then you will have ideas to try. To start out, as a test, you can take the defaults and just double the period / halve the rate.

1 Like

I put together some notes on setting frame status periods, but it’s a little under-tested.

… but the charts above don’t show that CAN utilization is high, which would make me hesitant to try that.

1 Like

I agree – not as high as I’ve seen teams run without complaint. But, it’s a time-varying value, and there aren’t all that many errors. I pretty often see teams with many similar issues, who are not complaining of how their robot is performing (I’m a CSA, and see such messages pretty frequently).

On the other hand, I’ve seen robots that have a lot on the CAN bus and run with no such messages. In the case where I’m most familiar (the team I mentor), this is so – and we pay attention to this sort of thing.

If the robot is running fine at this point, the easiest option is to ignore these messages. But, I would probably try to get to the bottom of things while you have some time. It will make it easier to troubleshoot anything further that shows up, plus it could make the robot more resilient.

Very nice write-up you linked, BTW.

1 Like