Fatal Robot Code Crash

We were having our robot sitting on a bench running for like 10-15 minutes, and randomly, we had a random crash while running in idle. I find it a little concerning because this looks like a C++ error (maybe through the JNI), but we are a Java team.

Error Message from Driver Station Console (no more data past this):
terminate called after throwing an instance of ‘std::bad_alloc’ 
 what(): std::bad_alloc 
 ********** Robot program starting ********** 
 NT: Listening on NT3 port 1735, NT4 port 5810 
 NT: Got a NT4 connection from 10.28.32.217 port 51941 

Supporting Libraries: Java, WpiLib New Command, RevLib, Phoenix, Rev 2m Distance Sensor (separate JAR than RevLib)

I don’t know what I can really provide to help diagnose it further…

1 Like

That means it’s running out of memory on the native side. Are you sending a lot of data, doing image processing, writing a lot of text output to the console, etc? Had it been a while since you restarted the Rio?

The Rio was probably live for 2 hours, but we did deploy multiple code version while it was running.

We are doing some IO, but I don’t think it is excessive? We probably have 200 SmartDashboard channels (many updating at 20ms), and writing it to a flash drive with the DataLog. We were reading the PDH currents every loop also (all 23 channels) and would get a periodic (once every ~15 sec) “ERROR  -1154  HAL: CAN Receive has Timed Out”. We slowed this down to every other loop after this, and the 1154 is basically gone now. I try to stop every text output possible from going to the dash.

CAN utilization (using an average function) is roughly 75%, our normal loop time in Periodic is ~4ms. I don’t think we are allocating much memory while running (beyond what swerve drive generates).

I’ve seen where other system processes (in particular the NI system web server) have memory leaks that exhibit over time. If you’ve only seen this crash once so far, I wouldn’t worry about it.

Does this error only happen if the robot is on for over 10 minutes? If so, I wouldn’t worry about it because it won’t be on that long in comp

I’ve seen some pretty long field delays that have robots sitting on the field powered on . Add the match time and there is a half decent chance that 6 robots at an event experience 10+min uptimes.

Not saying it is common for delays to be 7mins or more before the match, but it’s not unheard of.

Now, how one prioritizes fixing something that would only have a very small likelihood of occuring is up to the team and how risk adverse they are.

Source: self, not that I have sit there with a stop watch…

1 Like

I’m not sure if this is officially documented, but in my experience this has been an issue for years. I first ran into this in 2017 and it happened to me off and on as a student through 2020, and it’s happening to us on 6328 periodically as well. The cause is always the same: multiple code deploys without rebooting the Rio will always eventually trigger this for us, it’s just a question of how many. It seems like this issue has been dismissed every time it’s brought up, but given I’ve seen it across 3 teams personally I highly doubt it is a user code issue.

6 Likes

Same here for us today on 1591:

1 Like

Is this issue specific to OpenJDK 17?

Also, make sure you update to the latest WPILib. We’ve fixed at least one memory corruption bug in the latest release (2023.4.1).

3 Likes

Thanks for all you and all the WPILib team. We this we had a few issues today where we had this similar memory crash, but we on 2023.3.2. Hopefully the update solves these.

Is this unresolved? I’m on latest Rio and WPILib versions, but just experience this crash. This is the first time I’ve experienced the crash before.

We also had a few crashes due to std::bad_alloc in the last week or so, including once during a match and twice during testing last night. During testing last night it was within a few minutes of freshly turning on the robot.

If you have a reproducible native crash, please see WPILib 2023.4.2 Release - #6 by Peter_Johnson for instructions on how to collect a core dump. I’ll also need to know what version of the libraries you’re using (at least the WPILib version in build.gradle and the vendor deps, preferably the whole project).

Unfortunately, the std::bad_alloc error can be very challenging to root cause even with a core dump, because unless the memory allocation request itself is corrupted in some way, the error simply means the Rio is running out of memory, and there aren’t good tools available for finding the source of the leak from a core dump alone.

2 Likes

There’s no reliable way to reproduce, but we can turn on core dump and hope it happens next time we’re testing (seems to happen more often when the robot’s been on for a while). Thanks for the quick response!

We had a match this past weekend where our robot stopped responding for about 10 seconds and then went through a robotInit() again. It stayed enabled and our drivers said that they never lost comms. When looking at the DS log, we found a std:bad_alloc error right before the restart. We were also getting a ton of command scheduler overrun errors. I dug through the code and found that the programmers were binding new commands to buttons in the robotPeriodic() method. My suspicion is that we were running out of memory due to new Commands being created every 20 ms. We moved those lines of code to telopInit() and now our command scheduler overrun errors are gone and I’m hoping that our running out of memory problems are gone too. I’d check your periodic methods to see if you are doing something similar that doesn’t need to be repeated every 20 ms. Are you seeing any other errors while it’s running?

2 Likes

Interesting. We aren’t binding any new commands in any **Periodic() methods, but we’ll certainly look out for anything like that.

I’d check your periodic methods in Robot.java and the periodic methods in all your subsystems to see if there is any instances of creating new objects. We got lucky in the fact that the process of creating the new commands took a bit of time so it was throwing the scheduler overrun warnings, otherwise I’m not sure how long I would of had to look for the problem.

2 Likes

Echoing the same thing as Kyle. Will try to grab the core dump next time it happens, but we’ve been getting it about once a week all season. On the latest version of everything. Note we are on exclusively RIO 1s.

1 Like

I just ran simulation on my laptop and did a memory snapshot from IntelliJ, turns out we had a LOT (20k+) PathPlannerState objects on our heap, mainly from autos we didn’t choose. I did a local hack where I nulled out the extra autos in teleopInit, we’ll have to try that tonight at the lab.

2 Likes