debugging intermittent runtime problems using code instrumentation

*Having never had the opportunity to personally code an FRC robot, I don’t know all the specifics of what support the various languages and frameworks provide for debugging repeatable or intermittent realtime issues like I see being discussed all the time on CD.

But there is a technique my team and I used to use back in the day when we were developing software for embedded realtime systems.

I present it here for discussion.

put this code at the very beginning of TeleOp and every other periodic thread:

10 temp = time()
20 ptest += temp-tstart-period
30 if(ptest>period){err++; ptest=0}
40 tstart = temp
50 for(i=1;i<=burns;i++){x=1.0/(x+1.0)}

Explanation of line numbers:

Line 10 grabs a system timer. You want to pick one that has low overhead and sufficient resolution. I haven’t visited this issue since January of 2013 so I won’t make a specific recommendation here, but here is a thread worth reading:

Line 20 accumulates the thread start jitter in ptest. In an ideal world this would average out to zero.

Line 30 tests the thread start jitter accumulator ptest. If it exceeds one period (of the thread in question), then you’ve missed a cycle, so increment an error counter and reset the accumulator.

Line 40 updates tstart for the next iteration

Line 50 is something we used to call (way back when I was doing this kind of stuff) a “burn counter”. When burns=0 it does nothing. As you increase the value of burns it starts using CPU time. This is a great way to get a rough idea how much CPU margin you have. Set all the burns to zero (one for each periodic task/thread). Then test each thread one at a time by increasing the burns in that thread until either a) you start seeing the err counter increment, or b) your code starts misbehaving. Repeat for each thread.

Other notes:

I didn’t show initialization or scope of the variables. That will differ with your language. Do whatever is necessary for proper initialization*, and to assure that 1) each thread has its own temp, ptest, err, tstart, period, burns, and x variables, and 2) you can easily set the burns variable and monitor the err variable while the code is running. All variables must retain their values for the next time the thread runs.

temp, ptest, tstart, and period should all be the same type as returned by time()

“period” is the period of the thread

“i” and “burns” are 32 bit unsigned integers

x should be float, and should be initialized to zero.

*by "initialization I mean one-time (at startup) initialization, not initialization each time a thread executes.
*
*

Interesting - on 1678 we’ve just started to care about our code being realtime in the last couple years, so I can chime in about our technique for debugging these issues.

We program our robots with a message queue based architecture. Basically, any input or output is a message (we use protobuf for this) that’s passed through a queue. We log every message that gets sent, which includes a timestamp. If anyone suspects a RT issue, we plot the derivative of the timestamp vs the number of the message. Since we run at 200Hz, we expect to see 5ms for every point. Right now, we’re holding exactly to that in our threads that need to be realtime, and ±1ms in the gyro thread, which uses SPI.

This approach is nice if you have a logging system set up, but for a team that doesn’t, Ether’s approach may be more useful. The only caution that I have about it is that if you’re using the *Periodic functions in WPILib, your code will inherently not be realtime, since the *Periodic functions run when packets come from the driver station. If you want your code to be RT, you have to make your own threads, and deal with passing around the joystick data.

Yoir idea of a burn timer is interesting - maybe we should consider logging the amount of time left at the end of our loops so we can check how much overhead we have…

TeleOp runs once every time a packet is received.

What is the asterisk for in front of the word Periodic?

It’s what’s called a wildcard character, used to describe sets of things by a few different tools (regular expressions, linux’s glob(), and others). * means “put anything here”. For example, consider the set of variables:

this_variable
that_variable
how_about_any_variable
variables_are_my_friend

The expression “*_variable” means the first three, but not the last one.

(conclusion is that *Periodic means “any periodic control function”)

As to your initial comment, yup, we still do that. I do java, and we tend to use the programming structure that conforms to the following:
-The “main” program is event driven - whenever there’s a new control packet, we run the main code. Things which make sense to update on an event driven basis go here (ex, button which opens/closes claw - doesn’t make sense to update the claw state until you have a new driver control packet).

More controlsey stuff that needs precise timing independent of when commands arrive from the driver station get spawned off into their own threads. Java has built-in the functionality you mention, allowing each thread to say to the operating system “start me at this interval” and “I’m done, you can go do other things till I need to be run again”.

We do track the metrics you mention though - in our standard logging, we keep track of the wall-clock execution time and call frequency of critical loops, as well as processor load. This is most useful for when the drivers insist something is going wacky with the robot, we can help determine if there was a runtime anomaly in the code that caused control to freeze up.

Curiosity has the better of me:

Why was 200 Hz Chosen? Is this for all your control threads?

Historically, I’ve seen 20ms (50Hz) being the more common choice, and I was always under the impression that this was because the PWM signals sent to motor controllers were 1-2ms pulses sent every 20ms (so you couldn’t actually change your control effort faster than every 20ms). However, if you’re using all CAN…

One of the tricks I’ve used to see CPU overhead in “realtime-ish” Unix systems is the PS command with a custom set of expressions. Running the PS command in a 1-second ‘watch’ loop, appending it to a file , and then taking the results (which are tab-separated) into Excel shows spikes in CPU. If a process took a delta of more than 1000ms of CPU time per line, it meant the process was taking multiple cores. One benefit to this method is that it doesn’t require extra code.

watch -n 1 `ps -p [process id] -o pid,time,etime &gt;&gt; /path/to/file.txt`

“time” (aka cputime) is total CPU time since the process was started. “etime” is total time elapsed since the process was started.

This method isn’t precise per-thread or per-loop, but it isn’t meant to be. Apologies if the syntax is wrong; it’s from memory.

I’m not sure what the RoboRIO supports though.

Yes, I know a little bit about wildcard characters.

Anyway, my concern was that students reading wesleyac’s post might get the wrong idea about periodic tasks running only when a packet arrives.

Yep, 200Hz on all of our control threads. Basically, the higher the frequency, the better control you will get, and it’s pretty easy to hit 200Hz (5ms cycles), even if you’re doing a lot of logic in the control threads.

50Hz is a common choice, and I think that the historical reason for that has to do with it being easy to send over a PPM signal for radio-controlled applications. However, most things that can use PWM can use anywhere from 40-200 Hz. The talon SRX datasheet shows that it has a PWM input rate of 2.9ms, so theoretically we could go up to 344Hz. I’m not sure about the RoboRIO side of things, but IIRC the PWM signals are generated on the FPGA, so they should be pretty fast. The datasheet says that the max frequency of the PWM lines is 150kHz, but I don’t think that’s referring the the actual rate that you can update them in code. In any case, everything in the pipeline can definitely handle 200Hz.

We don’t use CAN for speed controllers (I could do a whole entire rant on why), but one thing that I haven’t looked into is how fast you can update. There will likely be more latency on CAN packets than PWM, and I would suspect that you would have a lower possible update rate, but I don’t have hard data on any of that, and it’s possible I’m wrong.

Is the core code behind 1678’s custom threads open-sourced? I wouldn’t mind taking a look. We run custom stuff as well, but the rate is based upon our actual loop processing time + a 5ms delay via the wpilib Timer.delay() method.

Please tell us why, without ranting.

but one thing that I haven’t looked into is how fast you can update. There will likely be more latency on CAN packets than PWM, and I would suspect that you would have a lower possible update rate, but I don’t have hard data on any of that, and it’s possible I’m wrong.

Minimum PWM period is 5ms. So that would be 200Hz. But teleop happens only every 20ms. So that’s the speed limit on reacting to DS commands.

CAN Talon has a user-adjustable control frame rate. Default is 10ms (twice as fast as teleop). See Section 20.7 of the software reference manual available at the manufacturer’s web site. CAN Talon built-in PIDF period is 1ms.

oops, understood now. Hopefully someone else can learn something useful from that :slight_smile:

We have our code release from kickoff here. Our 2017 code won’t be public until our competition season is over, but the basic architecture is the same.

We use a std::thread object, and we use some code that we stole from 971 for the loop timing. It’s in third_party/aos/common/util/phased_loop.cc. It takes into account the loop processing time, which is an advantage over using a static delay.

CAN has much more difficult to debug (and scarier) failure modes.

Failure wiring a PWM cable will cause issues with that motor, which can be easily found with continuity checking on a multimeter.

Failure wiring a CAN cable can cause effects across the entire network, not all of which are obvious or easy to reason about. Additionally, for what is essentially a network protocol, CAN has almost no checking of the data that’s received. Writing to CAN (at least, through WPILib) is non-realtime.

I’m aware that the frame rate is user adjustable. What I’m not sure about is the practical upper bound of that rate. Can you go down to 5ms, as you can with PWM? If so, how many devices can you update with a 5ms rate before you start to saturate the CAN bus?

Yeah, the built in PIDF on the talons looks really nice, and seems to me like the main advantage of switching to CAN.

What kind of data checking do you think is not in the CAN spec, but should be?

Can you go down to 5ms

Please describe a realistic FRC scenario that would require updating the commands to a CANTalon (running its built-in PIDF) at a 5ms update rate.

I don’t have very much CAN experience, but as an example, you can’t know if the packets that are being sent are being received by all of the CAN nodes. I also don’t like the arbitration method that’s used - any node sending a dominant bit can cause all other nodes to stop transmitting.

That being said, most of my concerns with CAN aren’t about CAN as a protocol, but about the failure modes that it has.

I have no doubt that CAN works well when properly implemented (clearly it works well in cars), but I don’t trust an FRC team to implement it properly. We’ve seen enough CAN issues that I definitely don’t trust our team to implement it properly.

When running internal controllers, updating at 200Hz is insane overkill. However, there are cases where we don’t want to run the built in PIDF. For example, on our 2016 robot, we cancelled out the force of gravity on our arm, with a voltage proportional to the cosine of the angle. There’s no way to get that in a talon PIDF loop. For applications where we want to use an observer, we wouldn’t be able to use a talon PIDF loop. As far as I can see, if you want to use a gyro + encoder setup for your drivetrain, even if you aren’t using a observer, you wouldn’t be able to use the talon PIDF loop.

Also, one thing that I’m not sure about is logging - we log all of the sensor values and outputs for all of our mechanisms. This helps a ton in terms of debugging. In order to get that for a PIDF loop, we’d need to either decrease the rate that we log at (the default for encoders is 100Hz IIRC, which is not nearly as much as I would like). Is there a good way to log all inputs and outputs at a reasonable rate without saturating CAN?

The cosine of the setpoint angle or the process variable angle?

The process variable. Wouldn’t make much sense to do it any other way :stuck_out_tongue:

Can’t tell if the emoticon means you are joking.

It doesn’t - I don’t see much point in applying a correction voltage for an angle where the arm isn’t…

Would someone else care to comment on this?

The point of the corrective voltage is to help your arm stay where you want it under the influence of external forces. If your arm isn’t where you want it to be, there isn’t much point in helping it stay there.

If your setpoint is really far away, maybe cos(angle) is drastically different at the setpoint than at the current PV. In that case, a common strategy is to use a motion profile so that your corrective term smoothly travels from your current PV to your setpoint.