Robot main thread hangs

In addition to the problem that we and other teams are experiencing with RoboRio lockups requiring a reset or power cycle to recover from (under discussion here, we are also experiencing, less frequently, the robot code’s main thread hanging for extended periods (10s of minutes).

The two situations are quite different: in the other case the Driver Station Comms and Robot Code indicators go red and only a reset or power cycle recovers. In this case, the Code and Comms lights stay green but the robot main thread doesn’t do anything. Fortunately in this case the Driver Station Restart Robot Code button works to recover.

I’ve attached gdb to the hung thread and this is what I see:

.

I can’t be sure but I’m suspicious of some code in the wpilibc’s Notifier, and have reported it as issue 2401 for allwpilib on github (Notifier alarmCallback may fail to schedule the next alarm in some circumstances · Issue #2401 · wpilibsuite/allwpilib · GitHub).

I’m curious if anyone else is seeing this. It typically occurs when we leave the robot running and disconnect the driver station. And the thread may resume running some minutes after we reconnecct the driver station. Because the intervals are so long it’s hard to be patient enough to get good timings for how long those intervals are.

What third party libraries are you using?

Are you using the REV V3 color sensor from the KOP?

Thanks for the inquiries. The 3rd party libraries are acquired from:

  • navx_frc.json
  • REV2mDistanceSensor.json
  • REVRobotics.json
  • Phoenix.json
  • REVColorSensorV3.json

Yes, the code we are running uses the color sensor (none of the devices listed, however, are actually present on the Rio, when this is occurring).

We’re pretty sure the REV V3 Color Sensor was causing a similar issue for us, and 5172 came to a similar conclusion.
See roboRIO hanging / No Robot Communication and Strange Loop Execution Time Jump

I think I’ve figured it out but won’t get a chance to confirm until tomorrow night.

The way TimedRobot schedules loop iterations is to keep adding the declared robot period to the expiration time; this is reasonable if the average loop execution time is less than the declared period, as it means that the code can catch up after occasional overruns. But, if the average loop time exceeds the declared period then the requested wake-up times fall farther and farther in the past. Since the expiration times are (at the hardware level) stored in a 32-bit register, eventually a time that is in the past looks like it is in the future and things hang until that future time is reached.

In our case the declared period is 10ms; when the Driver Station is connected this is fine as the average loop execution time is then about 5ms. But with no Driver Station, and for reasons I don’t understand, the loop execution time jumps to about 15ms, triggering the situation described above. I think a pretty clear indication that this is the problem would be if you saw it hang and then 35 minutes later (half the wrap-around time of the 32-bit microsecond counter, during which your loop doesn’t run so real time catches up to the future expiration at a real-time rate) it woke up again spontaneously.

@gdefender: this could explain why you saw this problem when the Color Sensor was adding significantly to your loop execution time.

In our case we’ll probably just increase the declared period to 20ms and be ok. In fact we’d probably be ok in competition anyway because the robot is never on long enough to reach the point where the past looks likes it’s in the future.

We’ll do some experiments and I’ll post the results.

2 Likes

Huh… an Integer underflow condition… interesting…

I confirmed this morning that it behaves as expected: the robot code hangs after (1/2*(32-bit microsecond-clock-rollover-period)) / ((robot-code-actual-period)/(TimedRobot-declared-period) - 1) and recovers (1/2*(32-bit-microsecond-clock-rollover-period)) later.

Numerically for the situation I was testing, this is (35.5 minutes)/(~15ms/10ms-1) = ~71 minutes to hang and 35.5 minutes to recover.

Thus the answer to this problem is: make sure that your configured timed robot cycle time is bigger than your average actual cycle time. It is not a bug in the library – it is behaving as designed and expected. We just didn’t have the right expectation!

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.