Understanding the pros and cons of "Lazy" motor controller wrappers

Some teams have added wrappers around the CAN motor controllers (TalonSRX, TalonFX, SparkMax) which track the last values set and only update the underlying controller when new values are provided. For example, the 2019 public release of 254s code includes LazyTalonSRX and LazySparkMax. The comments in these classes mention this is done to reduce the CAN bus / CPU overhead.

The motivation makes sense to me. In fact, it seems like strictly an improvement over the default implementation, which makes me wonder, why isn’t it the default? Are there any drawbacks to doing this? Specifically, why don’t the CTRE and REV libraries do this for you?

1 Like

Because it increases complexity and is no longer necessary. Prior to 2020, FRC_NetCommDaemon was responsible for managing the CAN bus connection and marshaling traffic, but it did so over an extremely slow RPC connection.

In 2020, the user program was given sole access to the CAN bus. This got rid of the latency on set() calls. The vendordep starts up a separate thread that monitors the bus and talks to diagnostic tools like the spark max client or phoenix tuner.

2 Likes

My impression was this wrapper was more motivated by its impact on the throughput than the latency. If a team has an intake or some other simple mechanism that runs at constant voltage, but is set with a frequency on the order of 100Hz, that seems like a lot of CAN packets for not a lot of new information. Is this not actually a big deal in reality? What kind of throughputs does the FRC CAN bus support?

Not anymore. The bigger concern is the amount of data the motor controllers send out as status frames. If you have lots of motor controllers, they can saturate the bus and cause packets containing the set() value or CAN encoder data to never make it or get zeroed out (due to how the vendordep handles cached values during bus preemption). You should always set the status frame periods to large values if you don’t plan to use the data they contain (e.g., temperature info, encoder data if it’s open-loop, DIO data).

2 Likes

If there are 20 CAN devices on the bus, 1 MBit / sec (or 125,000 bytes/sec) allows each device to send 6250 bytes per second. Obviously, there will be some overhead that takes up some of this bandwidth. However, looking at REV’s status frame here:

https://www.revrobotics.com/content/sw/max/sw-docs/cpp/_c_a_n_spark_max_low_level_8h_source.html

You can see that, in total, the 3 status frames account for just 81 bytes. If you have 16 Spark MAXs with motors going, the minimum bandwidth used is 81 * 16 = 1296 bytes. This is clearly not even close to 125,000 bytes. However, I’ve seen plenty of teams run into CAN bandwidth limitations on 16 speed controllers.

Where is the bandwidth going?

(Yes I know there are other CAN devices like RIO, PCM, PDP, CANcoders, etc.)

1 Like

I don’t know. My team was getting zeroed out encoder data before limiting status frame rates, and we weren’t afterward. It could be something related to what happens when several devices try to send messages at the same time, how they handle retries, and whether the cached values are zeroed out if a new status frame isn’t received within a set amount of time.

Maybe the DS isn’t measuring the bandwidth utilization completely?

2 Likes

Noting the 29 bit CAN device thing outlined here, we can roughly estimate that for every 10 bits of data, there’s 19 extra addressing bits.

So 1296 bytes = 10368 bits, 10368 * 29 / 10 = 30068 bits = 3760 bytes roughly. So even with addressing on 16 speed controllers, we’re only at like 3% utilization of the alotted 1 megabit.

Buuut that’s all supposing it pulses at 1Hz. So if it pulses at 10Hz that’s 30%. And 100Hz is 300%. And now that seems ridiculous. And now I think I’ve confused myself somewhere.

1 Like

That also assumes everyone is allocated their own timeslice in which to send data, so there are no conflicts. In reality, they aren’t coordinating, so you can get a variable amounts of conflicts every 20ms (the default status frame rate, I think?) depending on when each device booted up and started sending data on the bus. Hopefully, they’re smart enough to keep track of when they last got a conflict and shift over their send time repeatedly to hopefully a less congested timeslice, and then keep using that timeslice offset.

If they aren’t doing that though, you could end up with devices continually preempting each other as they push their resend times later and later. It’s kinda like the FMS Christmas Tree thing where Wi-Fi has a random exponential back-off for retries, but no one will ever get to send any data if everyone delays the same amount. At least with CAN, there’s a message priority that ensures someone makes forward progress so eventually everyone will get to send something in an uncontested timeslice. If they don’t record when that was though, the conflicts will just happen again during the next cycle.

Another, less extreme, example of this phenomenon is how my team used to have a separate thread per subsystem in 2019 that delayed until a set wakeup time 5.05ms in the future. Since the scheduling was asynchronous and determined by the OS, we got periodic spikes in the measured scheduling period when all the threads asked to run at once (like beats when two frequencies line up on a greatest common multiple). We now schedule everything on the main robot thread using TimedRobot.addPeriodic() so each subsystem gets its own uncontested timeslice, and we run as real-time so other processes don’t get in our way.

That level of cooperative multitasking isn’t doable for motor controllers though, so another time-domain deconflicting protocol would need to be devised. They could either be given a slice based on their device ID (only works if no other devices interfere, I think), or use the “recorded back-off” approach.

3 Likes

Why not anymore? Was the bandwidth also increased when the changes were made in 2020? I’m still struggling to understand the impact of 2020s changes on the throughput.

Is it a bigger concern because the controllers by default send out significantly more packets than the Rio does causing them to take up significantly more of bandwidth?

Before 2020, the set() call would take some large, nondeterminstic amount of time to actually take effect, and the robot code would block until they went through. get() calls on solenoids were the same way since those also went into the netcomm layer (pneumatic control module is a CAN device).

Yes.

Have those been fixed as well?

I believe so, but they’re also cached at the HAL level, iirc.

You’ve missed a few bits. Including bit stuffing, frame spacing, and all the other bits in the frame, the worst case frame transmission takes up 160 bits. The length does vary each frame. (This also does not account for the clock re-synchronization done every frame.)

So assuming every frame has 8 bytes of data (smaller frames will lower the effective bit rate):

64 bits of data per frame / 160 total bits per frame = 40% worst case amount of the bus used for data, which at 1mbit/s = 400kbit/s data rate.

This is 6250 frames per second (1000000 / 160), or 62.5 frames per 10 milliseconds (funny units I know, but it puts it on a similar order to the controller frame rates.) This includes all frames going in or coming out of the RIO. So if you have 10 motor controllers and nothing else on the bus, you can fit 6 frames at 10ms from/to each device before you reach 100% utilization. Though for scheduling reasons you probably don’t want to run the bus at more than 60% - 80% utilization.

Its quite a bit different than the WIFI problem, since WIFI can’t detect collisions, the time during a collision in WIFI just becomes wasted space. With CAN, during a ‘collision’ (just normal arbitration) one controller wins out over another, and the data is still sent, wasting 0 time on the bus. This is why WIFI has to implement random backoff, and CAN doesn’t. The CAN protocol actually provides all of the scheduling needed. If the messages are always the same on the bus, then the same preemption will occur, and the message will be sent at the same time each cycle anyway.

In theory, if every device on the bus correctly implements CAN, this should already exist, simply by the prioritization mechanism. If every device implements their internal CAN buffer as a priority queue, and since the CAN bus arbitration mechanism operates on a priority basis, you can model the entire CAN bus and all of the devices on it as a single non-preemptive priority scheduler. If you know every single message that will be sent on the bus, including the message size, frame rate, processor clock jitter etc. you can calculate the worst case scheduling time for every message on the bus, and whether or not it is even reliably schedulable in the first place. This analysis becomes a bit more complex with frames that are not sent periodically.

However, this analysis, and the information required to perform it, is not really available, so the point is likely moot. If a frame is unschedulable this should show up to the user as an error due to timeout by the API (i.e. the last time the frame was received by the RIO was longer than some timeout threshold). The above also does not account for bus faults (usually due to things like intermittent connections, lack of terminating resistor, maybe star topology if egregious enough).

Regarding the original question, there are a few pros/cons to sending control frames (i.e. the data from set() calls) at a periodic rate instead of ‘one-shot’. The obvious down side is the added CAN bandwidth. One benefit is that repeated calls to set() don’t actually send out a CAN message, and instead update at a predefined rate. Another possible benefit is that the desired state of the controller (aside from configuration parameters) is always up to date, which is a simple mechanism to prevent issues if a controller resets due to something like a breaker tripping.

Also since the underlying CAN calls have changed a bit, both libraries may have different behaviors than what it was in 2019. I know the REV API behavior is slightly different (see change log for API v1.5.1)

9 Likes

Well, despite there being plenty of bandwidth remaining (in 3512’s case, 40% IRL before doing status frame limiting), the fact remains that we get continuously dropped status frames unless we limit the status frame rates for things we don’t need. I don’t know why.

1 Like

I wonder if this is a problem with the target device being unable to update the status frame at the desired frequency, or a timeout that’s too low on its CAN calls in firmware. Very difficult to track that down, but at least one could check between different manufacturers to see if there’s a difference in dropped frames.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.