We were having our robot sitting on a bench running for like 10-15 minutes, and randomly, we had a random crash while running in idle. I find it a little concerning because this looks like a C++ error (maybe through the JNI), but we are a Java team.
Error Message from Driver Station Console (no more data past this):
terminate called after throwing an instance of ‘std::bad_alloc’
********** Robot program starting **********
NT: Listening on NT3 port 1735, NT4 port 5810
NT: Got a NT4 connection from 10.28.32.217 port 51941
Supporting Libraries: Java, WpiLib New Command, RevLib, Phoenix, Rev 2m Distance Sensor (separate JAR than RevLib)
I don’t know what I can really provide to help diagnose it further…
That means it’s running out of memory on the native side. Are you sending a lot of data, doing image processing, writing a lot of text output to the console, etc? Had it been a while since you restarted the Rio?
The Rio was probably live for 2 hours, but we did deploy multiple code version while it was running.
We are doing some IO, but I don’t think it is excessive? We probably have 200 SmartDashboard channels (many updating at 20ms), and writing it to a flash drive with the DataLog. We were reading the PDH currents every loop also (all 23 channels) and would get a periodic (once every ~15 sec) “ERROR -1154 HAL: CAN Receive has Timed Out”. We slowed this down to every other loop after this, and the 1154 is basically gone now. I try to stop every text output possible from going to the dash.
CAN utilization (using an average function) is roughly 75%, our normal loop time in Periodic is ~4ms. I don’t think we are allocating much memory while running (beyond what swerve drive generates).
I’m not sure if this is officially documented, but in my experience this has been an issue for years. I first ran into this in 2017 and it happened to me off and on as a student through 2020, and it’s happening to us on 6328 periodically as well. The cause is always the same: multiple code deploys without rebooting the Rio will always eventually trigger this for us, it’s just a question of how many. It seems like this issue has been dismissed every time it’s brought up, but given I’ve seen it across 3 teams personally I highly doubt it is a user code issue.
We also had a few crashes due to std::bad_alloc in the last week or so, including once during a match and twice during testing last night. During testing last night it was within a few minutes of freshly turning on the robot.
If you have a reproducible native crash, please see WPILib 2023.4.2 Release - #6 by Peter_Johnson for instructions on how to collect a core dump. I’ll also need to know what version of the libraries you’re using (at least the WPILib version in build.gradle and the vendor deps, preferably the whole project).
Unfortunately, the std::bad_alloc error can be very challenging to root cause even with a core dump, because unless the memory allocation request itself is corrupted in some way, the error simply means the Rio is running out of memory, and there aren’t good tools available for finding the source of the leak from a core dump alone.
There’s no reliable way to reproduce, but we can turn on core dump and hope it happens next time we’re testing (seems to happen more often when the robot’s been on for a while). Thanks for the quick response!
We had a match this past weekend where our robot stopped responding for about 10 seconds and then went through a robotInit() again. It stayed enabled and our drivers said that they never lost comms. When looking at the DS log, we found a std:bad_alloc error right before the restart. We were also getting a ton of command scheduler overrun errors. I dug through the code and found that the programmers were binding new commands to buttons in the robotPeriodic() method. My suspicion is that we were running out of memory due to new Commands being created every 20 ms. We moved those lines of code to telopInit() and now our command scheduler overrun errors are gone and I’m hoping that our running out of memory problems are gone too. I’d check your periodic methods to see if you are doing something similar that doesn’t need to be repeated every 20 ms. Are you seeing any other errors while it’s running?
I’d check your periodic methods in Robot.java and the periodic methods in all your subsystems to see if there is any instances of creating new objects. We got lucky in the fact that the process of creating the new commands took a bit of time so it was throwing the scheduler overrun warnings, otherwise I’m not sure how long I would of had to look for the problem.
Echoing the same thing as Kyle. Will try to grab the core dump next time it happens, but we’ve been getting it about once a week all season. On the latest version of everything. Note we are on exclusively RIO 1s.
I just ran simulation on my laptop and did a memory snapshot from IntelliJ, turns out we had a LOT (20k+) PathPlannerState objects on our heap, mainly from autos we didn’t choose. I did a local hack where I nulled out the extra autos in teleopInit, we’ll have to try that tonight at the lab.