Let's talk about CAN delays (and how to address them?)

In working through planning for the controls work that will need to be done during the Diff Swerve Beta2, The Controls & Software Boogaloo project, the problem of timing delays for CAN messages in the FRC control system has come to my attention.

So… this Austin Schuh guy at least seems to think it is a big deal (or can be).

What do others have to say on the matter?

For my own selfish reasons (relating to trying to control my favorite flavor of diff swerve), I am particularly interested CAN issue arising when trying to implement LQR control with Falcon 500s (in current control mode). Oh, and there are CANcoders in the sensor mix.

Assuming CAD delays (or rather non deterministic delays) are indeed an impediment to implementing the best version of a well behaved robot, what are our options? Do we just add a CPU to tame the CAN delay problem and leave the rest of the control code running on the RoboRIO? Do we throw up our hands, and move the entire control code over to a Co-processor (save for the required enable signal from the RoboRIO)? Is there another option behind Door #3?

Discuss.

Dr. Joe J.

2 Likes

If eliminating the delay turns out to be impossible, another option is reducing the controller gains mathematically so the delay at least doesn’t make the system unstable. How much they get reduced depends on how fast/far the system can move within the measurement delay duration.

Here’s an excerpt from my book that talks about how to do this rigorously.
lqr-latency-comp.pdf (330.2 KB)

The Python scripts that generated the plots are here if you want to play with them.

The paper discusses things in more detail, but a major downside of latency compensation is if your system accelerates too fast, the optimal feedback gains may be no feedback gains at all (i.e., feedforward only). This naturally has robustness issues.

WPILib implements this algorithm in the LinearQuadraticRegulator class for Java and C++. SysId uses it to compensate for encoder filtering delays in on-motor-controller feedback.

1 Like

A couple of things to note:

  • The CAN stack on the Rio has substantially changed since 2017; prior to 2020, CAN bus calls went through a IPC layer before transmission which added substantial delay and timing uncertainty; starting in 2020, this is no longer the case (CAN bus calls no longer go through an IPC layer).
  • There will always be some degree of timing uncertainty on CAN because the motor controllers have independent clocks and send status messages asynchronously onto the shared medium, so there is always the possibility of random collisions. It might be possible for the motor controller firmware to synchronize transmits based on periodic messages from the Rio (and avoid collisions e.g. with a settable offset time), but I don’t believe they do this at present.
2 Likes

Hold up, are you concerned about the latency of CAN control messages, or the jitter from one message to the next? Those are quite different problems.

What exactly does this mean? There’s only one CANbus (for controlling motors), so adding a coprocessor wouldn’t help collision-related issues.

I believe that jitter is the bigger problem and that latency isn’t great but if it were consistent wouldn’t be too terrible to manage (I have literally zero 1st hand information on this – I am going more by what I have been able to understand from reading discussions of others)

Let me start by saying I am completely out of my depth (and my information maybe be out of date based on some messages above) but my understanding is that the CAN stack on the RoboRIO was making things worse than they could be and that there was a way (that was legal per FIRST rules) to have another CAN stack (on another CPU) do a better job of managing messages.

So… I could be completely wrong about this. That’s why I am trying to get this discussion going.

R57. ROBOTS must be controlled via one (1) programmable National Instruments roboRIO (P/N: am3000), with image version FRC_roboRIO_2020_v10 or later.

There are no rules that prohibit co-processors, provided commands originate from the roboRIO to enable and disable all power regulating devices. This includes motor controllers legally wired to the CAN-bus.

[Emphasis mine]
It seems that co-processors could manage the inputs (presumably direct DIO inputs) and send throttle setpoints to the motor controllers, as long as the enable and disable signals are from the RIO. I’m not sure how this can be a readily inspectable item unless there are a small set of approved “well behaving” co-processors. Another possibility might be that this co-processor would send the setpoints to the RIO over, say I²C. As another alternative, this co-processor could tell the motor controllers what to do by emulating a sensor, taking the sensor data and feedback completely off CAN.

As @Peter_Johnson says, software upgrades to the RIO’s CAN stack have significantly decreased the amount of intrinsic latency from “software call” to “bits on the wire”. However, CANbus is by nature nondeterministic: higher-priority messages will preempt lower-priority ones (priority is defined by the arbitration ID, which is divided up into several different fields in “FRC-CAN” (FRC CAN Device Specifications — FIRST Robotics Competition documentation)). The arbitration, and possible retransmit attempts, happens at a low level in a CAN controller (an IP block in the FPGA, in the case of the RIO), and typically isn’t very accessible from software.

I’ve contemplated the possibility of including some type of reduced timestamp in CAN frames, so that latency could be known and corrected for, but there are at least two problems with that:

  1. Any such scheme would have to be implemented in the motor controller firmware. It’s not a DIY solution.
  2. You only have 8 bytes per frame to play with, so you sacrifice quite a bit of payload throughput even with only 16-bit timestamps.

Yea, there have been quite some improvements. It is on my list hopefully this beta test season to hook up a RIO to one of my test rigs with a nice RT kernel and a PCIe CAN card and see how well we can tune things up.

From a system point of view, I’d really like everything to happen at a repeatable, designed cadence. That fits the controls math best. The easiest way to do that is to set the commands up to be periodic and consistent. Something like:

  • send current command for motor 0.
  • send current command for motor 1.
  • send current command for motor 2.
  • (optional, would need protocol extension) send “flush” packet to have all controllers pick up the updates and queue data to be sent
  • get current amps for motor 0 (and potentially encoder reading or whatever else you need)
  • get current amps for motor 1 (and potentially encoder reading or whatever else you need)
  • get current amps for motor 2 (and potentially encoder reading or whatever else you need)
  • do snazzy controls calcs, let other CAN traffic happen, wait and start the cycle over.

That way the same thing happens every cycle and the timing can be consistent (within 1 CAN packet if something else is happening when this all kicks off). 1 CAN packet at 1 mbps is ~100 uS, which is 2% error on a 5ms loop. ie, not bad at all! This adds 1 cycle of delay (5ms), but that is well understood and model-able and controllable. That requires the other bus traffic to be lower CAN priority though which the current CAN ID bit allocation has challenges with.

As @calcmogul points out, you can always slow your controller down, but we should be able to do better :slight_smile:

Do you happen to know off the top of your head if there are any other threads involved with sending or receiving CAN data that would need prioritization to get to some sort of repeatable, periodic loop?

3 Likes

To be clear, arbitration is actually built into the transport layer in CAN. The dominant and recessive bits implement arbitration quite nicely so the highest priority packet (lowerst CAN ID) goes out first. As 2 packets are being sent at the same time, the dominant bits keep winning (0’s) when sending the ID, and any time a node on the CAN bus detects that it lost when sending it’s ID out to another node (it’s 1 became a 0), it stops sending and retries next cycle.

J1939 (and by extension ISO11783) implement this by reserving the first 3 bits of the CAN ID for a “priority”.

1 Like

That’s not relevant to FRC, though; the 29 bits of identifier are split up differently, and the whole thing is effectively the “priority”.

That was actually (partially) implemented in the Jaguars that everyone likes to complain about, via the “synchronous update” mechanism. See p. 20: http://carlosgj.org/Lorentz/7870.SW-RDK-BDC24-UG-7243.pdf

1 Like

I started this conversation to try to understand if
(A) if CAN delay issues were a significant problem in the path of implementing high performance control strategy using CAN
(B) if so, if anything could be done about it (short of getting FIRST/REV/NI/CTRE involved)

With regard to A. The timing is not ideal, especially if you want/need 5ms loop times.

With regard to B. I guess I am hearing that not much can be done.

Is that about the size of it?

2 Likes

I’ve been trying to think through this, and I suppose another solution would be to try and leave out the CAN bus from the system (as much as possible).

I don’t think you could do this with the Falcon/CanCoder setup, but I think it may be possible with a Neo/standard mag encoder setup. If you made a breakout board to read the signals from the Neo encoder before they go into the Spark Max, you wouldn’t have to rely on the CAN timing to get that information. Either feed that directly to the Rio, or make a nice little PCB with a microcontroller that feeds that data to the Rio any way you want (maybe use CAN with a timestamp, or a different digital bus).

Just a thought. This obviously is not as clean of a wiring/programming solution as all CAN, but it could give you better control.

FYI, that’s exactly what 971 does, and for this exact reason… their robots are all PWM-based, at least for anything that needs RT control over timing (I’ve not asked, but they might e.g. put an intake motor on CAN?). @AustinSchuh can talk more about their overall design approach.

5 Likes

I’ve heard rumors of that, but I wasn’t 100% sure.

That is pretty neat. It would be great to get their input on this.

Also, has anyone tried to read the encoder off of the Neo (without disrupting the signal going to the Spark Max)? I am making the assumption that its not too hard to do. You could probably figure it out with a multi-meter. I couldn’t find any documentation on exactly how it is wired.

3512 hasn’t had measurable controls problems (delay, oscillation, chattering) in our reference/state/input/output plots using CAN motor controllers and RIO sensors for our drivetrain, turret, and flywheel. We use an RT thread with 0.1 ms of jitter and a 5 ms sample period. We avoid using encoder data from asynchronous CAN status frames (as opposed to blocking on status frame receive, which idk how to do) for feedback because compensating for variable measurement delay from 0 ms to 20 ms (the nominal status frame period) would be annoying. If I recall correctly, the CAN layer provides timestamps, but it’s not exposed through the HAL.

We use the NEO hall effect encoder on our climber, which is sent via a CAN status frame, for soft limits because we don’t care about determinism for that.

You could, but keep in mind it’ll be noisy. The Spark Max applies a large amount of unconfigurable filtering, so you’ll likely have to do the same thing if you just splice the wires and read them yourself.

3512 attempted on-motor-controller feedback with the NEO hall effect encoder on our flywheel, but the internal filtering added too much lag to hit tight recovery tolerances. It has a 64-tap moving average sampled every 1 ms, so that gives (64 - 1) / 2 = 31.5 ms of delay. Non-oscillatory gains had 1-2 s recovery times compared to 150 ms for our RIO encoder setup. Measurement delay matters less for systems that accelerate slower like drivetrains, but it’s still not great.

We never tried a separate quadrature encoder plugged into the Spark Max data port (which does have configurable filtering) because alternate encoder support wasn’t implemented at the time (January 2020).

4 Likes

First off, great input. I really appreciate that everyone helping me to work through this.

I am reasonably committed to using the Falcons so you are making me sad if I can’t use the Falcon internal encoders and (I assume) its current sensors either if I want to use them for feedback in a system that aspires to a 5ms loop time.

CANcoders are in the same bucket then…

Not the best news I could have expected.

Trying to see what I can salvage. Suppose that CAN outputs are ok enough (based on 3251’s experience) then we could use the Falcons in current mode. If I rolled my own, non-CAN encoders and current sensing, do you think we could build a reliable State Space controller with a 5ms loop time?

Anecdotally - We got a twin falcon setup working at much better than 1-2sec recovery, with onboard controls only (and fairly low mechanical compression). So the enhanced CPR encoder is critical in these type of applications?

The Falcon built-in encoders have much higher resolution than the NEO (2048 CPR for Falcon vs 42 CPR for NEO), and the filtering is configurable. Their default is also 64 taps based on Bring Up: Talon FX/SRX Sensors — Phoenix documentation. I think the filtering matters more than the CPR, although lower resolutions require more filtering if you don’t want a choppy signal due to quantization noise.

1 Like

Peter nailed how we tackle this. I can provide a lot more detail.

The FPGA reports back when the PWM cycle ends. This is when the falling edge of all the PWM pulses go out. NI graciously adjusted the PWM cycle so that the falling edges are aligned rather than the rising edges. Each motor controller updates it’s output duty cycle on the falling edge, so this means that all the controllers actually update at the same time, and a fixed 5.05ms after the previous output. We then sleep until 50 uS after that point in time, wake up, sample all our sensors, and do our controls calcs. We then write the outputs to the FPGA to be picked up when the PWM outputs start again.

t=0 ms falling edge of PWM cycle
t=0.05ms schedule wakeup
t=0.15ms sample sensors
t=1.0ms start control loop calcs
t=1.5 ms controls calcs finish
t=1.7 ms FPGA registers written with new controls values
t=2.45 ms (assuming a 2.6ms PWM pulse) FPGA acts on PWM register value
t=5.05 ms start over

There is a lot of tuning we do to our software to keep these times constant and make sure we always get scheduled promptly. I’m happy to go into more detail, but there is a fair amount more here. We are seeing ~150 uS of scheduling jitter.

If you look at the math behind the Z transform (which is the math that discrete time controls use), it essentially assumes that control loops happen at a fixed, well known, period. Rather than fight that math, we instead build software to try to meet those guarantees.

I can go into more detail if you have questions.

11 Likes