Neo not working during practice match

My team and I recently competed in ISR #3 and will compete tomorrow at the Israel DCMP. In both of these competitions a weird problem occurred: during the first practice match our intake didn’t open. We use a NEO550 to control our intake, and it is the only neo we have on our robot this year. We moved the intake before the match and everything worked fine, but during the match it didn’t work. The spark max’s status LED was solid cyan (as it should be) during the 2 times it happened. After the matches, we took the robot to the pit and turned the power back on without changing anything, and then the problem disappeared. I checked the logs and found nothing out of the ordinary, and we didn’t change anything related to the problem before it occurred. We are worried it might happen again during qualifications or playoffs.

Does anyone know what might cause the problem? Does it seem to you like a code-related problem or an electronic issue?

2 Likes

TLDR; Do you check the status codes for all of the parameters that you set? If not, do that, and retry if they fail. Also check your CAN bus wiring integrity.

This sounds similar to an issue we ran into, fortunately it had a fairly simple solution. Our symptoms were similar. At txwac our hood would sometimes not work for a match. Pre-match checks were fine, bring it back to the pit and power on its fine. We had 19 NEO/NEO550 on the bot (now its 20), but it always seemed to only impact the hood. Blink code would be solid.

We threw everything we had at this problem, and spent a lot of time debugging, and found the following things:

  1. There were some CAN wiring issues. Some were major, the most annoying being we had the ‘loop’ part of the V2 cable come out of the housing on some of the connectors. We repaired or replaced all of the CAN wiring to where we are confident that the wiring is good. The situation improved, but was not resolved. We suspect the wiring issues caused some CAN bus errors, but not enough to shut down the bus or prevent communication most of the time.
  2. We finally were able to reproduce the issue in a lab setting. Connecting to the REV Hardware Client showed that at least one parameter, kP was set to the default 0.0, instead of the value we had wanted. This means that our position controller won’t move, and thus is likely the root cause. However we did get an error message in the driver station log. The driver station logs from our matches where the hood did not move did not show any CAN errors.
  3. We spent a lot of time trying many different ‘fake’ combinations of timing and other CAN traffic to try and cause this error and not have it log to the console. We also tried to get a similar error to occur and have the API incorrectly return REVLibError.kOk. All of these attempts were unsuccessful, and any errors were logged, and APIs returned some error.

Based on the above I suspect the root cause (at least for us) is as below:

  1. We start off all of our Spark Max initializations with a call to restoreFactoryDefaults() that means all of the parameters, including the kP that we need for our control loop, are reset.
  2. Even though our ‘typical’ CAN bus utilization is at ~50-60%, during initialization lots of parameters are being set across all of our devices, so both CAN and CPU utilization are likely higher than normal. This is especially the case this year with the increase in the number of motor controllers.
  3. We assign CAN IDs based on priority, with drivetrain being the highest. The hood was fairly low, so it is more likely that this device would have an issue on a saturated bus compared to the higher priority devices. The even lower devices may not have had a parameter like kP that would keep it from moving. This actually matches your case too, since all REV devices are lower priority on the CAN bus compared to all CTR devices.
  4. Based on the above, for some reason the call to set kP to our desired did not succeed. My best guess is the bus was saturated and the message was dropped. We did not previously check the return code, so the robot code did not attempt to set it again.
  5. Typically the robot is turned on before the driver station is connected when setting up on the field. I suspect that the failed parameter set did send a CAN error to the console, but since the driver station is not set up, and the field is not connected, the error message itself was dropped. (Can anyone confirm that this could happen, e.g. no buffering of errors?).

We made the following changes, and did not have a single issue from there, including all of txfor and all driver practice etc.

  1. Add a small delay (50ms) after the call to restoreFactoryDefaults(). I don’t think this actually did anything and we may remove it. Small delays like this during initialization are not a big deal though since there is plenty of time between power up and driving.
  2. Increase the CAN Timeout from 20ms to 50ms sparkMax.setCANTimeout(50). The thinking here is each parameter call will wait that long for success. If there is an error, that may mean the CAN bus is saturated, so setting the timeout longer gives the bus more time to settle without having to add any waits.
  3. Check the status for every parameter set. If a parameter set fails, log it so we can actually confirm if this happens or not, and retry. In our case, each controller has an initialization function, and the whole function fails if any single parameter fails. If the function fails for that controller, retry it up to 5 times. If it fails after 5 times display a big error and wish the drive team luck. (We never had it fail after a single retry).
13 Likes

Thank you very much! We will try your solutions and hopefully the problem won’t happen again.

Mind sharing your code for this?

1 Like

We’re not quite ready to share our code base for this year, but I can post the relevant portions.

Here is an example log message that we got before the start of Elimination 13 at txfor (i.e. this chunk of code was the difference between our team being able to shoot at all or not during our first semi final match.)

[UserLog] 2022-02-05 21:03:39 - Spark Max - WARN: Spark Max ID 21: Failed to initialize, attempt 0 of 5

ID 21 is our indexer. This is the only warning we got for that match.

4 Likes

Where do you setCANtimeout in Labview - not seeing it?

I believe we are sporadically seeing similar behavior.

@Omri058 I would add that you should consider calling burnFlash() after configuring your controllers. Since neither of you shared your complete code, I don’t know if you are already doing this.

It’s a best practice to call burnFlash() to make sure the configuration persists across controller power cycles. Since robot initialization code only runs once, if the controller loses power after initialization the configuration will be lost. It seems less likely this is the problem if the intake failed from the very beginning of the match, but it’s possible.

Thanks for the advice!

Hey Will, This was very helpful. Thank you for sharing it!

1 Like

I’m trying to find where this information about .burnFlash() is documented. Searching REV’s github repo of example code uses this call exactly zero times. The API documentation for .burnFlash() doesn’t provide any motivation for why or when you’d want to call it.

I also want to amplify @Will_Toth 's reply as it is very enlightening on debugging intermittent Spark Max failures. We had one in a practice match last month at Chezy. We guessed that it was a missed CAN configuration setting on startup. Our clue was that one or both the Spark Max controllers wouldn’t be in brake mode (which can be diagnosed via the status lights). We started by simply checking the set kP value and if it wasn’t correct, we’d re-initialize. After that our problem went away.

1 Like

I think this is the best you’re going to get: Migrating from CTRE Phoenix to SPARK MAX - SPARK MAX

For the fastidious, complete checking of code operations could include verifying each set parameter with a get to compare with the intended set.

I vaguely remember reading recently (sorry, I don’t remember the reference or any details) someone pointing out some WPILib parameter setter did not have a corresponding getter. The response was the program should know what it set so a getter isn’t needed. That seems short-sighted and discounts the cleverness of users to create value and competitive advantage with features as @Will_Toth has demonstrated in the painfully hard area of troubleshooting.

Forget competitive advantage - everyone should have robust code even if it takes a little extra time and makes the code look ugly. It’s always much more fun to see the robot moving and not as inert as a brick.

2 Likes

We think we’ve been running into similar Spark Max parameter setting bugs all season. We’ve just implemented a new error checking procedure during robotinit(). Now for every (readable) parameter:

  1. We set the parameter and wait a few milliseconds
  2. We interrogate the Spark to see if the parameter matches the value we set. (originally we tried just checking for the “OK” return value, but some errors still seemed to slip through)
  3. If the parameter doesn’t match, we repeat this up to 5 times.
  4. If the parameter still doesn’t match, we display a red light on the dashboard warning the drivers which subsystem didn’t initialize properly so they can restart the code, or go check the robot before the match starts. (In that way, it works a bit like the sendablechooser “checkbox” in shuffleboard)

All this 3-way communication error checking made for some pretty ugly code, but I hope it prevents the same kind of errors that killed our last two playoff matches at DCMP.

Given that we’re not the only ones to experience this issue, is there any chance that a future REV library would build in this kind of parameter-set error-checking? @dyanoshak @Will_Toth @Greg_Needel @EricLeifermann ?

2 Likes

We do exactly that sort of error checking, and started it last year when we found that very high CAN utilization could cause motors to miss being initialized.

1 Like

We have a pretty sophisticated wrapper around the SPARK MAX API: Swerve/SparkMax.cpp at master · Jagwires7443/Swerve · GitHub. FWIW, it’s been offered to REV as a possible starting point.

This manages all config settings via a state machine which runs in a Periodic() function (this could easily be moved to a background thread). It does this partly to spread out the CAN bus traffic, partly to limit the amount of work done in constructors, partly to take care of all the initialization while the robot is sitting there disabled, and partly so it can handle any motor controller reboots. One consideration is that not all config settings are persisted – in particular, the periodic status frame periods are volatile. These settings can be important, and this code also takes care of this complication.

In particular, this provides API calls to update config, to check config, and to persist config settings. Config settings are not handled by a large collection of discrete API calls, but by passing lists of key/value pairs through a very small API surface. We normally only burn config settings in test mode, when a Shuffleboard widget is clicked. But, we check and can apply config settings each time the robot start up, or in case of motor controller restarts.

This wrapper also takes care of error handling. This is the subject of philosophical debates, but in this wrapper, there are a couple of invariants: code runs independent of any errors (up to running with no CAN bus, no actual SPARK MAX devices, etc.); and, errors are handled, rather than propagated outward. All of these attributes make the API really easy to use (it’s C++, but it would be easy to offer Java bindings also). They also make things very resilient. And, as a bonus, the idea here is that this API could be used with any smart motor controller – only the config parameter data would need to be specific to a particular type of motor controller.

The main point here is that there are different approaches that can be taken. If REV were to offer an API that removed most of the motivation for developing this, we’d certainly move over to it. Until/unless something like this happens, it’s very likely we will continue to roll our own. We have very little reason to use the REV Hardware Client to manage our config settings with this approach – we only use it to assign CAN IDs.

So I only ever saw issues here when we had 20 Spark Maxes last year plus the PDH and PH. I haven’t seen any hints of errors this year in our logs. We have 13 Spark Maxes, plus the PDH.

I also spent a decent amount of time trying to get the value to return OK but not actually set the value, and was unable to, even with some extreme test conditions. The only exception was when setting the CAN timeout to 0 (since this is non-blocking and defers the error check). At the lowest level, the API call for setting parameters when in blocking mode, will check the return status from the controller, and only returns kOk when it confirms the parameter was set. If you were seeing something different it would be interesting to hear about, especially if you have some logs or anything. If you’re not already logging all the retries and error behaviors, I would be sure to add that, and send to [email protected] ideally with code if possible.

Also our wrapper from last year is in our code https://github.com/FRC3005/Rapid-React-2022

1 Like

I am reporting this 2nd hand, so I may be missing critical details:

We found at least two parameters, voltage compensation and setInverted() that appeared to not get set properly sometimes despite returning kOk. We think a cause of of certain motors “not working” (not moving when commanded, and showing a constant purple light instead of flashing green or red) was that the voltagecompensation parameter was set to 0.

We have 18 devices on the CAN bus this year, which is a few more than previous years. CAN utilization peaks at ~70% according to the logs. That’s not enough to be dropping CAN messages I thought, but we’ve had this parameter-setting problem regardless.

Other teams I know who’ve had this issue have opted not to reset any parameters on init for this reason, opting to trust that the parameters have been set and burned correctly already.

Inspired from a post I saw @Will_Toth make, I wrote an implementation of their concept of initialization checking. Basically you check that every initialization message responds with kOk or you send the message again. This means even with near 100% CAN bus usage during startup we can insure all motors are properly initialized, the problem could be an instantaneous spike to 100% during initialization.

Code linked here.

1 Like

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.