Robot enters stateless "state" on official FMS

The last two finals that we’ve been in this season, our robot has had control issues in one of the matches. If you watched the PNW District Championship in Portland last weekend (shoutout to 360 and the event staff for allowing that replay of the second match), you probably saw the symptoms: The robot started the match off fine, but a few seconds into auto, it stopped moving and didn’t start again until close to the end of the match. Looking at the driver station logs, I see that the robot was in auto as it should have been, but then it just wasn’t. Not disabled, just stateless. Our driver noted that according to the driver station, we still had comms and the code was still running and enabled. After he rebooted the RoboRIO, everything was fine and we played the last few seconds of the match.

We ran into a similar problem in the Philomath finals. The robot made it through autonomous, but a bit into teleop we lost code and the robot drove itself into the low bar (we suspect that the Talons were just running using their last input). Just like in Portland, the robot code had somehow lost its state.

We’ve been trying to figure out what the problem could be, but we’re at a loss. Typically, when the robot code crashes, its handler will restart it fairly quickly; we don’t see that happening here, so we don’t think it’s a run-of-the-mill crash. The logs–from the driver station and our own logging–don’t indicate that the program crashed either. We know we didn’t get an errant disable signal from the field because we would have seen that state change. Does anyone have any ideas as to what could be happening here? The logs in question are attached.
A few notes:

  • The RoboRIO continues running after the code stops and our round trip times stay fairly low.
  • The errors in the Portland log that occur near the crash are coincidental. Similar errors show up in most of our logs, and in particular they don’t show up before the crash in Philomath.
  • Some of the messages in the Portland log got a bit polluted by a stray “cout” that made it into the program.

    logs.zip (125 KB)


    logs.zip (125 KB)

For anyone wanting to watch the match, click here for the stream feed and here for the full field feed (thank you so much FIRSTWA!)

Mind if i ask what programming language you are using? That could be part of the issue.

We’re using C++ with the WPILib command-based model.

Very similar thing happened to us on our final 4 matches at AZW last weekend. Started match fine, during auto the robot stopped, about 50 seconds later control was reestablished.

NI has our log files and is looking at them, but from what the CSA told us, the robo rio said the radio was lost, then when it came back up, the code restarted.

I sent you a message on Facebook, but I might have an idea on what caused the crash at Portland. Still looking into the one at Philomath.

The one at Portland looks like the DriverStation Thread on the RoboRIO locked up. If that happened, comms and robot code would still be reported, however the robot would stop sending requests saying its enabled, and the main loop would not continue to run. I have a guess on where in the WPILib that would be, but would like to see your guy’s code to double check.

What would indicate to the communications process that robot code was running if the DS thread was locked and not calling into the communications process anymore?

Actually I was backwards. If user code freezes but the DS loop stays running, those would be the symptoms. The DS loop would flag the comms saying it is running, however when using IterativeRobot, the flags stating which mode the robot is in is are called in the main user code loop. So if the main user thread froze, or got stuck in a blocking call, the DS would still see comms and code, however the mode report calls don’t happen and no movements would be responded to since no main loop is running.

Thad helped us track down a potential cause of a deadlock in our autonomous code that may have been the issue in Portland, but we’re still trying to figure out what led to the initial failure in Philomath. We’d be happy to hear from anyone who comes up with something.

The trace for auto, teleOp, and Disabled are reported as your code responds to the auto, teleOp, or disabled state of the control packet.

Apparently, as Thad was saying, that thread was no longer making that call. This could be because of logic, sleeps, infinite loops, etc.

In the philomath log, the CPU usage drops to background levels when this happens. Your disabled code CPU usage is 50%. Your auto was 67% then ~%51. TeleOp was 66%. Then it drops to 15% when the traces stop. About 10 seconds later, the driver station starts to request code restarts and finally a reboot. The roboRIO and user process come back up and go back to that state of low CPU usage. More code restart requests, and another roboRIO reboot request. When the code finally came back during disabled, the CPU usage looks like it matches the mode.

During TeleOp, my guess is that the code was crashing or locked waiting for a mutex that was never released.

The Portland logs show the same Disabled - 50%, Auto 61% and then 51%. But when teleOp starts, that doesn’t change like the other log file. It appears that the code doesn’t call the trace instrumented function, and remains in auto. The CPU stays at 51% until the code restart. This is quite different from he other log since it wasn’t tracing, but seemed to still be running auto.

When the code comes back from the code restart request, the CPU usage is at 70%, but I don’t see any indication that the mode function was called. The roroRIO was then rebooted, and when it comes back up the roboRIO is at 70% and the getMode is being called normally.

At this point, there are no messages in the logs that definitively log a crash of a C++ program or Java program. I hope we can improve that. Hopefully this additional info helps you identify anything that may be contributing to this.

Greg McKaskle

The weird thing I noticed in the Philomath log is that the DS stopped sending enabled packets and was sending disabled packets, however clearly the motors were still turning, as thats the only way the battery voltage could be that low. I don’t have an explanation for how that happened, as I always thought that if the DS was sending disabled packets that the motor controller outputs would be disabled. They use CAN Talons, so I wonder if something happened there.

We had these exact symptoms during two consecutive matches at our DC district.

The core issue was that we enabled our custom comms code for the first time in a live match. We attempted a connection to a socket (RIO->D/S), yet didn’t change the default timeout on the connection (2 minutes). When the robot attempted to connect during the beginning of auton and didn’t connect right away, the connecting thread would block. Depending on where the code that calls the connection lives, this could (and did for us at first) block the entire robot thread. For reference, we use Java.

We never actually solved why the robot wouldn’t connect to the D/S through the FMS using our custom comms on the correct ports - we simply moved in a different direction after a hitting a certain level of frustration.

Are any of you running vision code off of the RoboRIO? That could be taking resources from the main code.

We had a weird problem during a few of our matches where the roborio had 99.99% CPU usage, and we had VERY jerky control of our robot - probably like a second or two of delay. We had no vision code running, but we did have the OpenCV libraries included but NOT in use. We did not experience this problem every match, but it was often enough for it to be a concern. Eventually, the problem just went away and never came back. This was after going through the code line by line (C++ command based BTW) many many times looking for any sort of resource leaks, and found none.