In addition to the problem that we and other teams are experiencing with RoboRio lockups requiring a reset or power cycle to recover from (under discussion here, we are also experiencing, less frequently, the robot code’s main thread hanging for extended periods (10s of minutes).
The two situations are quite different: in the other case the Driver Station Comms and Robot Code indicators go red and only a reset or power cycle recovers. In this case, the Code and Comms lights stay green but the robot main thread doesn’t do anything. Fortunately in this case the Driver Station Restart Robot Code button works to recover.
I’ve attached gdb to the hung thread and this is what I see:
I’m curious if anyone else is seeing this. It typically occurs when we leave the robot running and disconnect the driver station. And the thread may resume running some minutes after we reconnecct the driver station. Because the intervals are so long it’s hard to be patient enough to get good timings for how long those intervals are.
I think I’ve figured it out but won’t get a chance to confirm until tomorrow night.
The way TimedRobot schedules loop iterations is to keep adding the declared robot period to the expiration time; this is reasonable if the average loop execution time is less than the declared period, as it means that the code can catch up after occasional overruns. But, if the average loop time exceeds the declared period then the requested wake-up times fall farther and farther in the past. Since the expiration times are (at the hardware level) stored in a 32-bit register, eventually a time that is in the past looks like it is in the future and things hang until that future time is reached.
In our case the declared period is 10ms; when the Driver Station is connected this is fine as the average loop execution time is then about 5ms. But with no Driver Station, and for reasons I don’t understand, the loop execution time jumps to about 15ms, triggering the situation described above. I think a pretty clear indication that this is the problem would be if you saw it hang and then 35 minutes later (half the wrap-around time of the 32-bit microsecond counter, during which your loop doesn’t run so real time catches up to the future expiration at a real-time rate) it woke up again spontaneously.
@gdefender: this could explain why you saw this problem when the Color Sensor was adding significantly to your loop execution time.
In our case we’ll probably just increase the declared period to 20ms and be ok. In fact we’d probably be ok in competition anyway because the robot is never on long enough to reach the point where the past looks likes it’s in the future.
We’ll do some experiments and I’ll post the results.
I confirmed this morning that it behaves as expected: the robot code hangs after (1/2*(32-bit microsecond-clock-rollover-period)) / ((robot-code-actual-period)/(TimedRobot-declared-period) - 1) and recovers (1/2*(32-bit-microsecond-clock-rollover-period)) later.
Numerically for the situation I was testing, this is (35.5 minutes)/(~15ms/10ms-1) = ~71 minutes to hang and 35.5 minutes to recover.
Thus the answer to this problem is: make sure that your configured timed robot cycle time is bigger than your average actual cycle time. It is not a bug in the library – it is behaving as designed and expected. We just didn’t have the right expectation!