Some teams have added wrappers around the CAN motor controllers (TalonSRX, TalonFX, SparkMax) which track the last values set and only update the underlying controller when new values are provided. For example, the 2019 public release of 254s code includes LazyTalonSRX and LazySparkMax. The comments in these classes mention this is done to reduce the CAN bus / CPU overhead.
The motivation makes sense to me. In fact, it seems like strictly an improvement over the default implementation, which makes me wonder, why isnât it the default? Are there any drawbacks to doing this? Specifically, why donât the CTRE and REV libraries do this for you?
Because it increases complexity and is no longer necessary. Prior to 2020, FRC_NetCommDaemon was responsible for managing the CAN bus connection and marshaling traffic, but it did so over an extremely slow RPC connection.
In 2020, the user program was given sole access to the CAN bus. This got rid of the latency on set() calls. The vendordep starts up a separate thread that monitors the bus and talks to diagnostic tools like the spark max client or phoenix tuner.
My impression was this wrapper was more motivated by its impact on the throughput than the latency. If a team has an intake or some other simple mechanism that runs at constant voltage, but is set with a frequency on the order of 100Hz, that seems like a lot of CAN packets for not a lot of new information. Is this not actually a big deal in reality? What kind of throughputs does the FRCCAN bus support?
Not anymore. The bigger concern is the amount of data the motor controllers send out as status frames. If you have lots of motor controllers, they can saturate the bus and cause packets containing the set() value or CAN encoder data to never make it or get zeroed out (due to how the vendordep handles cached values during bus preemption). You should always set the status frame periods to large values if you donât plan to use the data they contain (e.g., temperature info, encoder data if itâs open-loop, DIO data).
If there are 20 CAN devices on the bus, 1 MBit / sec (or 125,000 bytes/sec) allows each device to send 6250 bytes per second. Obviously, there will be some overhead that takes up some of this bandwidth. However, looking at REVâs status frame here:
You can see that, in total, the 3 status frames account for just 81 bytes. If you have 16 Spark MAXs with motors going, the minimum bandwidth used is 81 * 16 = 1296 bytes. This is clearly not even close to 125,000 bytes. However, Iâve seen plenty of teams run into CAN bandwidth limitations on 16 speed controllers.
Where is the bandwidth going?
(Yes I know there are other CAN devices like RIO, PCM, PDP, CANcoders, etc.)
I donât know. My team was getting zeroed out encoder data before limiting status frame rates, and we werenât afterward. It could be something related to what happens when several devices try to send messages at the same time, how they handle retries, and whether the cached values are zeroed out if a new status frame isnât received within a set amount of time.
Maybe the DS isnât measuring the bandwidth utilization completely?
So 1296 bytes = 10368 bits, 10368 * 29 / 10 = 30068 bits = 3760 bytes roughly. So even with addressing on 16 speed controllers, weâre only at like 3% utilization of the alotted 1 megabit.
Buuut thatâs all supposing it pulses at 1Hz. So if it pulses at 10Hz thatâs 30%. And 100Hz is 300%. And now that seems ridiculous. And now I think Iâve confused myself somewhere.
That also assumes everyone is allocated their own timeslice in which to send data, so there are no conflicts. In reality, they arenât coordinating, so you can get a variable amounts of conflicts every 20ms (the default status frame rate, I think?) depending on when each device booted up and started sending data on the bus. Hopefully, theyâre smart enough to keep track of when they last got a conflict and shift over their send time repeatedly to hopefully a less congested timeslice, and then keep using that timeslice offset.
If they arenât doing that though, you could end up with devices continually preempting each other as they push their resend times later and later. Itâs kinda like the FMS Christmas Tree thing where Wi-Fi has a random exponential back-off for retries, but no one will ever get to send any data if everyone delays the same amount. At least with CAN, thereâs a message priority that ensures someone makes forward progress so eventually everyone will get to send something in an uncontested timeslice. If they donât record when that was though, the conflicts will just happen again during the next cycle.
Another, less extreme, example of this phenomenon is how my team used to have a separate thread per subsystem in 2019 that delayed until a set wakeup time 5.05ms in the future. Since the scheduling was asynchronous and determined by the OS, we got periodic spikes in the measured scheduling period when all the threads asked to run at once (like beats when two frequencies line up on a greatest common multiple). We now schedule everything on the main robot thread using TimedRobot.addPeriodic() so each subsystem gets its own uncontested timeslice, and we run as real-time so other processes donât get in our way.
That level of cooperative multitasking isnât doable for motor controllers though, so another time-domain deconflicting protocol would need to be devised. They could either be given a slice based on their device ID (only works if no other devices interfere, I think), or use the ârecorded back-offâ approach.
Why not anymore? Was the bandwidth also increased when the changes were made in 2020? Iâm still struggling to understand the impact of 2020s changes on the throughput.
Is it a bigger concern because the controllers by default send out significantly more packets than the Rio does causing them to take up significantly more of bandwidth?
Before 2020, the set() call would take some large, nondeterminstic amount of time to actually take effect, and the robot code would block until they went through. get() calls on solenoids were the same way since those also went into the netcomm layer (pneumatic control module is a CAN device).
Youâve missed a few bits. Including bit stuffing, frame spacing, and all the other bits in the frame, the worst case frame transmission takes up 160 bits. The length does vary each frame. (This also does not account for the clock re-synchronization done every frame.)
So assuming every frame has 8 bytes of data (smaller frames will lower the effective bit rate):
64 bits of data per frame / 160 total bits per frame = 40% worst case amount of the bus used for data, which at 1mbit/s = 400kbit/s data rate.
This is 6250 frames per second (1000000 / 160), or 62.5 frames per 10 milliseconds (funny units I know, but it puts it on a similar order to the controller frame rates.) This includes all frames going in or coming out of the RIO. So if you have 10 motor controllers and nothing else on the bus, you can fit 6 frames at 10ms from/to each device before you reach 100% utilization. Though for scheduling reasons you probably donât want to run the bus at more than 60% - 80% utilization.
Its quite a bit different than the WIFI problem, since WIFI canât detect collisions, the time during a collision in WIFI just becomes wasted space. With CAN, during a âcollisionâ (just normal arbitration) one controller wins out over another, and the data is still sent, wasting 0 time on the bus. This is why WIFI has to implement random backoff, and CAN doesnât. The CAN protocol actually provides all of the scheduling needed. If the messages are always the same on the bus, then the same preemption will occur, and the message will be sent at the same time each cycle anyway.
In theory, if every device on the bus correctly implements CAN, this should already exist, simply by the prioritization mechanism. If every device implements their internal CAN buffer as a priority queue, and since the CAN bus arbitration mechanism operates on a priority basis, you can model the entire CAN bus and all of the devices on it as a single non-preemptive priority scheduler. If you know every single message that will be sent on the bus, including the message size, frame rate, processor clock jitter etc. you can calculate the worst case scheduling time for every message on the bus, and whether or not it is even reliably schedulable in the first place. This analysis becomes a bit more complex with frames that are not sent periodically.
However, this analysis, and the information required to perform it, is not really available, so the point is likely moot. If a frame is unschedulable this should show up to the user as an error due to timeout by the API (i.e. the last time the frame was received by the RIO was longer than some timeout threshold). The above also does not account for bus faults (usually due to things like intermittent connections, lack of terminating resistor, maybe star topology if egregious enough).
Regarding the original question, there are a few pros/cons to sending control frames (i.e. the data from set() calls) at a periodic rate instead of âone-shotâ. The obvious down side is the added CAN bandwidth. One benefit is that repeated calls to set() donât actually send out a CAN message, and instead update at a predefined rate. Another possible benefit is that the desired state of the controller (aside from configuration parameters) is always up to date, which is a simple mechanism to prevent issues if a controller resets due to something like a breaker tripping.
Also since the underlying CAN calls have changed a bit, both libraries may have different behaviors than what it was in 2019. I know the REV API behavior is slightly different (see change log for API v1.5.1)
Well, despite there being plenty of bandwidth remaining (in 3512âs case, 40% IRL before doing status frame limiting), the fact remains that we get continuously dropped status frames unless we limit the status frame rates for things we donât need. I donât know why.
I wonder if this is a problem with the target device being unable to update the status frame at the desired frequency, or a timeout thatâs too low on its CAN calls in firmware. Very difficult to track that down, but at least one could check between different manufacturers to see if thereâs a difference in dropped frames.