Consistent insufficient memory error

Our robot is consistently failing with an insufficient memory error. We are using the same base code base as the last two last years, just upgraded to WPI Lib 2024, Phoenix 6, and RevLib v2024.2.1. With just our Swerve Drive Subsystem enabled, our code errors out every time with an insufficient memory error immediately after robotInit() finishes. The robot then proceeds to reboot.

I tried to get a core dump by doing the following these instructions by @Peter_Johnson shown here:

ssh into robot
/usr/local/frc/bin/frcKillRobot.sh -t
ulimit -c unlimited
./robotCommand

The program ran and ended with:

16:55:38.089 INFO  frc.team3838.robot.Robot       : ~ ROBOT BOOT UP COMPLETED ~
OpenJDK Client VM warning: INFO: os::commit_memory(0xb04b6000, 3391488, 0) failed; error='Not enough space' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 3391488 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/lvuser/hs_err_pid2404.log

But it failed to create a core dump. I’ve attached the hs_err_pid log below. I’ve tried several times to get a core dump, but it always ends the same way.

I’ll also note that occasionally we get the error

terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable

and sometimes

terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable

In both cases the code terminates.

We are getting these insufficient memory errors on two different robots (our test bot and competition bot). As I mentioned, this is the same code base we’ve used for years. And currently we have no other Subsystems other than the Drivetrain. So the main differences are new lib versions, and using Java 17.

hs_err_pid2404.zip (11.1 KB)

Our versions:
DS: 24.0.1
RIO: FRC_roboRIO_2024_v2.2
Lib: Java 2024.2.1
Pigeon 2: 24.1.0.0 (Phoenix 6)
PCM: 1.65 (Phoenix 5)
CANCoder vers. H: 24.1.0.0 (Phoenix 6)
RevLib: 2024.2.1

Any help is appreciated.

Our team was having memory issues, and came to the realization that our code is now using too much for a RIO to hold. We did tests on the RoboRIO 2 and the errors immediately went away. We were overtaxing the RoboRIO, so we commented out most of the code, and it works well for a good while. We are upgrading to the RoboRIO 2 almost solely for its superior memory.

Yeah I’m worried that it might be a cumulative thing that everything wants just a little bit more memory each year. And this year we crossed a threshold. Scares me if I’m crossing it with just our swere drive and no other Subsystems. I’m not doing anything that should be memory intensive. But I’ll look. Moving to a roboRIO 2.0 would be nice. But we just have not been able to get one. They are slways out of stock. Hoping others might have some other ideas. :crossed_fingers:

1 Like

One thing to try is disabling the Rio webserver to free up some memory. This can be done using the WPILib RoboRIO Team Number Setter tool.

1 Like

3+ mb seems like a lot to try to allocate at once in a robot program. What are you doing in your robotInit? How do you know it’s failing immediately after robotInit finishes?

What’s weird is this in the hs_err log file (thanks for posting that):

MemFree:           16532 kB
MemAvailable:     107768 kB

So there’s plenty of memory available, and even enough RAM free, to satisfy that allocation request.

It sounds like the memory fragmentation issue discussed in this thread:

Based on other responses in that thread, try adding overcommit_memory=1 to /etc/sysctl.conf?

THANKS everyone for the responses.

Thanks, I’ll give that a try on Tuesday (our next meeting)

We init our subsystems and auto command. Although so far this year I have just the swerve drive subsystem. No other subsystems and no auto command yet. The thing is, I am not doing anything now that I have not done for the past 12 years. I use the same core Command Robot code every year. We recently switched to swerve drive. But have had not this issue with it the last two years. So I am not doing anything different in our robotInit that I have not done for many years. I do use a logger rather than System.out.println. But I’ve been using that since the cRIO days. I’m going to hook up a profiler to see how much memory that might take, or if I can see anything else that is eating up memory. But as I said, what’s frustrating is this is a smaller program than last year since I do not have any other subsystems coded yet this year other than the swerve drive. And we had some pretty robust subsystems last year.

As for the 3mb allocation, given the timing, I really don’t think it is anything I’m directly construction/initializing that is attempting to allocate the memory. It seems like something in the background. Or possibly as noted in next comment, to do with GC.

I log a ā€œROBOT BOOT UP COMPLETED`ā€ message as the last thing in robotInit. There is then a 1 or 2 second pause after that message is logged, and then the insufficient memory error occurs. So while the IO/network latency of the logging showing in the net console could be causing me to misinterpret when the memory error is happening, it has been very consistent in happening that way every time over a couple dozen runs. This morning I was starting to wonder if given the pause in operation after robotInit, if the GC is triggering and something is happening there. Given the change from Java 11 to 17, and the change in GC algorithms, its just a passing thought given it is one of the only ā€œwhat is differentā€ things I can think of.

My pleasure. Despite working in Java for 25 years, I never learned how to read an hs_err log file. Always been able to resolve the few memory issues I’ve ever faced via a profiler. And I’m a backend server guy.

Thanks. I’ll take a look at that and try that setting on Tuesday.

Additional Question

One other thing I will ask about, is that I also get a ā€œwarningā€ message about the Phoenix Server being up and running. In years past, I thought I only got that when programming the CTRE CANs (after installing the temporary diagnostic server). But after a restart, it went away. I upgraded to Phoenix v6 and started using Phoenix X this year. Is it possible there is something going on there?

Could you share a link to your code repo (and branch/commit id)? I have a roborio 1 that I can try to reproduce this on this afternoon. More eyes might help.

This is the second year using Java 17 JRE by default. What changed this year was setting source and target compatibility to Java 17 instead of 11. That should have a much smaller impact then switching from the JRE to 17 JRE last year, although there’s certainly possible that there’s a regression. That should be easy enough to test, though. Since you’re using Kotlin, maybe it has more of an impact then regular Java users.

As I’m sure you saw in the other thread, you can try changing to the same GC as last year using the instructions here: GitHub - wpilibsuite/GradleRIO: The official gradle plugin for the FIRST Robotics Competition

As a followup… Thanks for the help everyone. This does seem to be a memory fragmentation issue. I was able to spend some time in our Tuesday meeting looking at this. Everything seems to indicate we should have enough memory. Stopping the web server as discussed in the memory fragmentation thread (mentioned above) frees up about 19MB of memory. And with that shut down, we don’t have any memory issues. That said, we were just able to order a roboRIO 2.0 this morning. So that should fully alleviate the issue.

When I was troubleshooting this, I wanted to attach a profiler to the robot code to see if there was anything wonky in our memory usage. But the 32-bit arm linux agent from my profiler (YourKit) did not seem compatible. I wouldn’t mind taking another look at this in the off season when time is in better supply. Is there a profiler that can be used with the JRE running on the roboRIO? Or alternatively, is there some place to obtain the full JVM so that I could look at using jmap or jcmd? It’s not clear to me if this is just the standard OpenJDK, or some special version.

I used the InteliJ profiler for simulation and found it to work well enough. Even found a memory consumption issue with REVLib in it.

You can also use VisualVM if you want to profile CPU usage. Idk if it does memory well.

Thanks. I did attach to the desktop simulation. But with so many things being ā€œsimulatedā€, I wasn’t sure it was giving an accurate picture.

Even found a memory consumption issue with REVLib in it.

In their library? Or your code? If the library, has it been fixed? We use Rev controllers for our swerve drive. And its when I turn these on that I get pushed over the edge on memory.

Thanks for the VisualVM link. I completely missed that in the docs.

1 Like

It’s not in the docs yet :slight_smile: just a PR. I am linking to it now bc it’s a very good doc.

It’s in their library. It originates from REVLibError.fromInt which is called from every setReference and setVoltage call among others. You can read about it here

1 Like

We build it from the OpenJDK sources, but we build only the JRE, not the full JDK. GitHub - wpilibsuite/frc-openjdk-roborio: OpenJDK builds for FRC RoboRIO

2 Likes

What’s weird is that from the little coding I know - ā€˜std::system_error’ seems like a C++ exception. Which is weird because the robot code is Java, right? I can’t find any reason for this.

While your robot program is written in Java, there’s lots of C++ code running behind the scenes. The JVM is written in C++, as are significant portions of WPILib and vendor libraries.

The interesting thing about that specific system_error message is that most of the references I can find for it indicate it’s a failure to create a thread. Which could be caused by low memory (as creating a thread requires stack allocation, etc), but in general code shouldn’t be creating a lot of threads continuously over execution.

It was appearing a lot in the same case as the original post. We had the rio taking the driver station logs to its memory and so the insufficient memory error was common. It’s mostly fixed now, but it sometimes - pretty randomly - appears once or twice after the code crashes, and power cycling the robot fixes it.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.