CANSparkMax/CANEncoder crashes with SIGSEGV when CAN device is not connected

We frequently test our robot code on a RoboRIO that is not plugged into the rest of the robot, which means that some or all of our normal motor controllers will be absent from the CAN bus.

We expect CAN errors in our logs, and this is fine as we’re only testing certain subsystems. But what actually happens is sometimes our robot code crashes when accessing a CANEncoder for a missing motor controller (probably inside JNI).

I ran a simple test with the following sample code, which was enough to reproduce the problem:

package frc.robot;

import com.revrobotics.CANEncoder;
import com.revrobotics.CANSparkMax;
import com.revrobotics.CANSparkMaxLowLevel.MotorType;

import edu.wpi.first.wpilibj.TimedRobot;

public class Robot extends TimedRobot {

  // Note a device with id 1 is not present on the CAN network
  CANSparkMax spark = new CANSparkMax(1, MotorType.kBrushless);
  CANEncoder encoder = spark.getEncoder();

  @Override
  public void robotPeriodic() {
    for (int i = 0; i < 5; i++) {
      encoder.getVelocity();
    }
  }
}

An example error looks like this:

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x4143202c, pid=7939, tid=7976
#
# JRE version: OpenJDK Runtime Environment (11.0.4.10) (build 11.0.4.10-frc+0-2020-11.0.4u10-2)
# Java VM: OpenJDK Client VM (11.0.4.10-frc+0-2020-11.0.4u10-2, mixed mode, concurrent mark sweep gc, linux-)
# Problematic frame:
# C 0x4143202c
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid7939.log

Here’s the referenced error log: hs_err_pid7939.log.txt (39.2 KB)

Is there any way to make this a warning instead of a hard crash? I suspect it’s just a bug in the JNI bridge (look like a null pointer dereference).

This topic seems to be related, however they discuss an allocation issue, whereas I’m able to reproduce this problem with only a single .getEncoder() call at the beginning of the program:

I suspect this is the same issue, it may not have been the allocation of the new encoder object causing issues, but trying to interface with the non existent Spark MAX.

I do understand that in this post without the allocation every time it didn’t crash, however this isn’t something I tested myself.

As far as I am aware, the issue is related to the CAN TX queue on the RoboRIO filling up (due to there being nothing connected) and then trying to access the Spark Max.

Did you have a PDP or anything else connected to the CAN bus on the RoboRio?

I’ll see if I can reproduce this tonight.

Here’s a temporary solution to this. This appears to only happen upon startup and won’t crash if a spark max that’s already configured is removed from the chain.

After you initialize the spark max object in code, surround all your configurations (including the getEncoder() call in your class constructor with a simple check.

public ExampleSubsystem(){
        exampleSparkMax = new CANSparkMax(id, motortype);
    if (!controller.getFirmwareString().equals("v0.0.0") {
        //your config here
    }
}

This way at least it won’t crash your entire code with a vague SIGSEGV.

1 Like

Yes, the PDP and one other Spark MAX were connected when I ran my test.

Thanks for the suggestion, but while this might prevent a getEncoder() call from doing something bad, it would also require us to add null checks in many other places in our code that try to call methods on the encoder. It would be preferable if the objects themselves just returned default values without crashing. (We could consider creating a subclass or wrapper object which would modify the behavior, but ideally the bug(s) could get fixed and then it would just work :slight_smile:)

Interesting that this is the case, the issue I created my other thread for, was when there was nothing connected to the CAN bus, since you do have devices on the bus, I think this is a different problem.

Have you checked that you are running the latest version of WPILib and the Rev API? And that all the devices you have connected are running the latest versions of their respective firmware?

With a single call in RobotPeriodic, I do not reproduce the crash. With two calls, it does crash.

I managed to track down the root cause of the segfault earlier this week, and submitted my findings to REV. They’ve mentioned they’re reworking the error reporting to eliminate the issue entirely.

The segfault is reproducible with some of REV’s simple examples as well.

2 Likes

Thanks! What’s the root cause?

There was a buffer overflow in c_SparkMax_FlushErrors() that would occur if too many errors of the same type had accumulated since its last call.