CPU spike to 100% - robot unresponsive

Hello all,

We had an issue where our robot became unresponsive during two matches - one practice match, and one elimination match.

Looking at the driver station logs, we saw that the CPU usage of the cRIO spiked to 100%. The motors continued to run at the voltages they were set at before this moment - this meant that we drove (usually at full speed) into a wall and then stalled our motors.

In the past, when we’ve seen deadlocks (like when two WhileHeld commands both required the same Subsystem) CPU usage would drop to near 0%. This is making us believe the problem is in some sort of infinite loop somewhere… likely in our own code.

Has anybody seen an issue like this before? If so, how did you debug/diagnose it?

Thanks!

Adding a

Timer.delay(2/1000);

at the end of a teleoperated loop can work wonders.

Sounds like a code thing. You’re probably running a loop somewhere under a certain condition. Usually you wouldn’t see anything close to 100% from normal usage.

The fact that your drivetrain becomes unresponsive but continues to drive could mean 1 of 2 things. One is that something else loops and takes all processing, without updating the drivetrain voltage (this is less likely - motor safety would need to be off). The other is that something in your code (PID?) is looping, and setting your drivetrain a certain speed for that period of time, without letting anything else continue to run. Is one of your triggered commands running something on the drivetrain?

Let me know if that makes sense, I could take a look at the code if it’s Java too.

I have helped teams debug this problem at events many times over the years, and the majority of the time the issue is a “for” or “while” loop where there shouldn’t be one. With an Iterative or Command-based robot, you seldom need loops in user code.

Other less common causes are interrupts or TimerTasks that take too long to run and miss their deadlines, very long print statements, and deadlocks.

I would put the robot up on blocks and try to reproduce the failure.

One strategy that is quick-and-dirty that I use on desktop apps is to run the program in a debugger, then pause execution and look at the current line once I hit 100% CPU usage. If your CPU utilization is normally fairly low, this method has good odds of stopping in the problematic section. Consider it CPU profiler roulette :slight_smile:

Are the motor safeties on a separate thread (and thus somewhat unaffected by a cpu usage spike)? I would hope that the watchdog is somehow resilient to some of our code demanding lots of cpu time.

joelg236, I’ll see about packaging up our code.

Reproducing the failure has been the hardest part - we’ve only been able to get it to reproduce in matches. We ran the robot substantially in the pits to no effect.

We have a spare cRIO back at home that we can run code on, but it will be missing sensors/actuators.

I think the motor safeties are in a separate thread. Although I do know that they are not enabled by default, unless you are using the RobotDrive class, which since you guys are octocanum I would guess that you are not. So if you have not explicitly enabled them and are not using the RobotDrive class, there would be no motor watchdog enabled. The FPGA watchdog will only shut the robot off if it looses connection to the DriverStation.

What do the ping status’s look like during those matches? That could help debug as well.

Usually when I’ve seen 100% cpu usage, the crio actually stops responding to the DriverStation, and the watchdog would usually kill the robot. That would show up in the logs as well.

If anybody wants to look at our code, I’ve placed a drop here:

For some basic explanation of the architecture:
http://www.chiefdelphi.com/media/papers/2912

I’ve also attached a picture of the DS log of the elimination match. Ping / packet loss looks consistent before and after the 100% event.

The complete loss of connection you see about 75% of the way through is where our drivers did a remote soft reset of the cRIO - it comes back up just seconds before the match ends, I think.





So that log definitely has something wrong. The line that is across 16 shows what the DS is commanding to the robot. The lines above that show what mode the robot is actually in. By the looks of it, it looks like the robot stopped reporting what mode it was in when the cpu jumped to 100%. I don’t know if its the FPGA that reports those things, or if its the CPU, and that would be a question that the NI people could answer better. But that could help get a better understanding of where the error is.

The robot reports the code state, in the iterative robot base class. It lends credence to the theory that the robot is getting stuck in a loop.

We had a similar issue, caused by a loop. Essentially, we had something like (pseudocode):


while(limitSwitch == false)
{
    motor = -1.0;
}

We just added a small timer.delay in there and it completely solved our problem.

We were able to find the issue by comparing our match logs to what we actually did in the match - we saw that every time the CPU spiked, it was while we were shooting the ball. So, we could narrow down where we were looking and walk through the code until we found it. Note that it didn’t cause issues every time it spiked, only a couple of times.

Yeah, we suspect the same - but there are no obvious candidates. We’ve been searching through our code and only have a few candidate loops (of the while / for variety).

This weekend we’ll try and run some longhaul/stress tests on our backup cRIO and see if we can reproduce the issue.

As a general rule, try not to use while loops. There are parts on your robot project that automatically update (on your command you should be able to use execute() as your loop. You can also use the teleopPeriodic() in the main project file too)

We’ve been trying to repro the issue and have been inspecting our code, with no luck.

Let’s assume for a minute that we can’t find the root issue in time. What other options are there?

Is there a way to perform a quick program reset? Instead of rebooting the cRIO, is there any way to just kill the robot process and start a new one?