Attached is our code, along with the patch that ‘fixed’ everything (by removing Vision and network tables). Nicely, we did go on to compete well, even putting up our best performance in a 106 point QF match (sadly, we lost the QF due to non software glitches, but it was still a great weekend!).
I spent a bit more time looking at the code, and found one other interesting lead. We sent the current speed of our wheel (if it changes) using NetworkTables every 10 ms. That seems like the sort of thing that could trigger a problem.
So, the summary as I understand it is as follows:
We have 4 or 5 running threads: 2 for vision, 1 for the Driver Station (actively looping inside the Run() method of the DS code - hence the log updates), and a main thread, running inside our OperatorConsole() method. We may also have a thread for our shooter wheel, which is a pid controller thread.
The vision and DS threads all appear to continue working the whole time; the log suggests that. At the failure point, the main thread appears to die. Probably as a side effect of that, the DS thread then goes on to trigger the motor safety code and shut down the motors.
There is one known bug: in a target rich environment, the vision code consumes enough CPU that a mistaken Wait(0.04) in the shooting code can cause us to exceed the 0.10 timeout and generate a motor safety error.
My initial guess (and hence this thread title) was that the FMS system would shut down our robot on detecting a motor safety error; that guess appears to be quite wrong.
Occam’s razor suggests that the logical explanation is that there is a bug in our OperatorCosole() code that causes our main thread to crash. I cannot for the life of me find any such bug.
If I range further afield, my next hunch is that this line of code:
distanceTable->PutNumber(“speed”,shootEncoder.GetRate());
if called every 10 ms can lead to a crash of some kind.
Perhaps we couldn’t reproduce it because the FMS network conditions are subtly different. Perhaps we couldn’t reproduce it because we tended not to run the shooter wheel all the time during our testing (the put only happens if the shooter wheel is running). Perhaps it only happens if we’re doing the PutNumber calls and a call to setErrorData is happening in a different thread. (I don’t have the source code for setErrorData, so I can’t examine that possibility).
The only other faint leads I have are these. We did experience a bug such that if you first did a PutNumber of a variable, and then later did a PutBoolean of that same name, you would get a crash. We also had a superstition that doing too many PutNumbers in a row would lead to a crash (although we were never able to confirm that superstition, and it was muddled with the Boolean/Number crash and the mandatory C++ update that was supposed to fix all kinds of network table crashes).
At any rate, thanks for listening. And thanks to Google for archiving all of these thoughts, in the off chance that it might help some other poor soul <grin>.
Cheers,
Jeremy
team2823-code-2013.zip (25.3 KB)
team2823-code-2013.zip (25.3 KB)