TimedRobot loop overruns and inconsistent loop timing (Java)

I have been working with Team 8 for the past few days and we’ve noticed that our loop period is significantly overrunning the allotted 20ms. It’s currently on the order of 100-150ms (varies greatly), so all the code is only running at about 7-10Hz, which is painfully slow and visibly manifests itself in control responsiveness.

We spent some time tracing what part of the code is taking up all of this time. What surprised me is that there doesn’t seem to be one singular hold up – everything just runs slowly.

We started by logging the time taken for each function call in teleopPeriodic (first image). It makes sense that reading sensors (update_state) and setting outputs (update_hardware) take the longest, but the timing of all the function calls seems an order of magnitude off.

We then looked at the timing within update_state, trying to pinpoint a single function call that was blocking (second image). What we found was that each part of the method seemed to just be taking a long time. The encoders are the hall effects on the NEOs, the pigeon is connected via a Talon SRX, and the ultrasonic is through the RIO.

Lastly, we tried emptying teleopPeriodic and measuring how accurate the loop period was (third image). Surprisingly, the loop period varied from 15ms to 21ms, finally oscillating between 19ms and 21ms after roughly ~100 cycles. (the number in the image is just System.nanoTime()/1.0e-6)

I took a quick look at the CPU utilization with our normal code, which hovers around 70% and occasionally spikes to 85-90% for very brief periods of time. The CAN bus also sits very consistently around 60-70% utilization. All of our libraries and firmware are up to date as well.

We are thoroughly perplexed as to what could be causing the loop overruns, inconsistent loop period, and whether the observations are connected. I am under the impression other teams run more computationally intensive tasks on their RoboRIO without problem, which further confuses me.

Open to any suggestions of how to further diagnose / explore this, or comments on similar experiences.



The latest roboRIO image (v14) is supposed to fix this. That said, even with the latest image I still get loop overrun warnings, especially when running the robot in autonomous. If someone could fix that I would be mighty happy.

Yeah even updating to that we still run into these issues. I’m thinking about trying to toss everything into its own thread and eliminating the wpilib notifier from the equation.

A band-aid, but yes, it it could help eliminate the warning.

Another common-to-everyone issue that was rumored was some CAN communication from the REV sparkMax libraries was taking longer than it ought to on some of the beta releases.

Are you able to post code? Another set of eyes may be able to do some analysis to see if there’s anything obvious slowing it down.

I’m having the same issue as well. At first I thought it was CTRE calls taking the longest, but I then noticed it varies a lot for us as well and CTRE calls may not be the problem.

We have our loop time at 55 ms and with that, the only overruns I get are when it’s initializing or when a period starts, which is acceptable.

We’re using v14 as well.

1 Like

We saw some issues last night, and think it may have been related to flashing the radio with the bandwidth limit on. We’re going to probe that more tonight, but figured I’d share that breadcrumb in case it helps others.

Tried moving all of the code to another thread which had no effect. So I guess we’ve confirmed that the API calls and state machine are just running slowly.

Checked the DS logs, attached a screenshot. The voltage blip is a teleop enable with a quick drive forwards and backwards. Our CPU utilization is a bit high, but I don’t see any evidence of throttling (never hits 100%).

Yes! https://github.com/team8/FRC-2019-Public

If you notice anything suspicious that might be causing these problems, please don’t hesitate to point it out.

v14 fixed an intermittent watchdog timeout that occurred rarely and would disable the robot for 2-3 seconds. It did nothing to change loop timing.

We didn’t have any luck after updating to v14, same issues persisted. Wondering if WPILib 2019.4.1 will magically fix this. Going to try tomorrow.

That version doesn’t fix it. The changes in that version were really small anyway.

I notice you use solenoids. https://github.com/wpilibsuite/allwpilib/issues/1622 might be relevant if you’re setting/getting them often. I know of at least one instance where solenoid calls were taking much longer than I would have expected. Historically PDP inputs have had a similar issue, but it was supposed to be fixed this year.

1 Like

The radio didn’t help. We are still seeing an issue just like this, which we cannot figure out. We do use a lot of controllers on 0.01 loops, and they often are called as much as 0.5 seconds apart. Tried 2019.4; tried newer ctre libs. Nothing seems to help. We can mitigate it a bit by starting our control loops in robotinit, but that just reduced the delays; does not eliminate them.

On this topic, other than manually inserting time logging, does anyone have a profiling tool recommendation?

I’d be interested too. I’ve not yet found a great solution that works on the RIO. There were a number of desktop (JProfiler is one I’ve used) that work fine, but that doesn’t capture if there’s hardware interaction which is bogging you down.

Looking through a bit now. One thing I noticed during disabled was a call to System.gc() which may cause some variability in timing while disabled (could explain that intermittent packet loss above? maybe??), but shouldn’t impact behavior while enabled.

Your CSV logger does appear to allocate multiple new HashTable's every loop as well, I think? We do similar things so this isn’t a sole cause of issue. Might be worthwhile gutting out that class temporarily to see if that cuts down on execution time.

Your startup variability with empty teleopPeriodic and later stability is in line with behavior we’ve seen in V12/V13. I haven’t checked with V14 yet.

Just to add to the thread, we are also repeatedly overrunning the loop with roboRio image v14 with WPILib 2019.4 update and the latest Pheonix, Spark Max, and NavX libraries.

We’ve been able to bring our loop time down by removing some of the periodic Spark Max config calls as well as eliminating all print statements. This was after a suggestion from @calcmogul that I/O that use netcomm (print statements, CAN) might be introducing significant lag.

The loop time is still variable, generally staying around 20-40ms but occasionally spiking to 100+ms for a few cycles. Doing some more investigating to see what the spikes are about.

Can confirm that we are experiencing the same issue with java, I guess that print, config calls, smartdashboard calls, etc. can cause those issues but I will update if I will have more input.

we haven’t actually logged how long each loop takes one thing for sure we get at least 2 loop overrun 20ms messages every time we enabled the robot. the first one appears a matter of seconds when the robot initializes, the second one comes up randomly. Probably because the only time consuming task we that I can think of is parsing a CSV file for autonomous which gets added to an arraylist (~300 trajectory points). Before this when we used WPI_TalonSRX which would give us even more warnings from the motor safety because the motors were not updated enough. This would actually cause physical problems making the motors jitter since the motor safety would turn it off until the next periodic cycle. So we ended up disabling the motor safety (bad idea, I know) which stopped the problem. Same thing with using differential drive. I checked other teams code and see whoever used WPI_TalonSRX needed to disable motor safety and I didn’t see anyone, especially top tier teams. So I believe it’s just the packets getting dropped when the robot is connected wirelessly. We hardly ever connect thru ethernet since we usually deploy and test code frequently and don’t want to be tethered down. Even when we are using ethernet we forget to disable wifi so it does not try to look for other networks which would cause packets to be dropped.