Drivetrain not running in auto when connected to FMS

Team 8 was competing at the Monterey Bay Regional this past weekend and had a strange issue during auto where all subsystems would run except the drivetrain. This only occurred when connected to the FMS and we could not replicate it in the pits or on the practice field despite numerous attempts. It sounds awfully similar to 1690’s intake issue (but with falcons instead of neos).

Robot background
  • Roborio 2.0
  • 4 drivetrain falcons, 2 shooter falcons, neos for everything else
  • Drive subsystem uses ramsete and Talon FX velocity mode
  • Paths are loaded from pathweaver jsons
  • Custom routine architecture based on 254’s code
Summary of issue throughout competition
  • Auto worked correctly during Ventura and 10+ practice matches on Thursday at MBR.
  • The issue was first seen in Q5 when running 4 ball auto. It occurred again in Q12 so we reverted to Thursday’s code which was lacking a few print statements.
  • 4 ball auto drove in Q17 so nothing was modified afterwards.
  • 2 ball auto was deployed for Q23 and the issue came back. The issue persisted when 4 ball auto was run in Q30 and 2 ball auto in Q34. We added a delay before the beginning of auto following the Q34 failure.
  • 4 ball auto worked in Q38. Drive team noted they had restarted the driver station app before connecting in both Q38 and Q17. They began to do this in every subsequent match.
  • 4 and 2 ball autos worked in Q43 (Fri) - Q63 (Sat). The delay before auto was removed and pathweaver paths were adjusted. Autos continued to work.
  • 4 ball auto failed again in Q67. We moved pathweaver json loading to autoInit to limit loop overruns since file I/O can be slow. We also removed an unused call to DriverStation.getGameSpecificMessage().
  • 4 ball auto failed in QF4M1 so we swapped driver stations and added back delay.
  • 4 ball auto worked in QF4M2 and failed again in QF4M3 with no code changes.

Based on the limited information we were able to gather I think we have ruled out a few possibilities:

  • Loop overruns — we got rid of the print statements and file I/O causing loop overruns and still saw the issue.
  • Driver station app — problem resurfaced even while we were restarting the driver station before matches.
  • Custom routine framework — we were able to confirm with print statements that the drive routine began and finished as expected during auto failures in quarterfinals.

Possible causes:

  • Our can bus utilization is fairly high (around 80%), so perhaps some of the configuration commands are not going through. We did not see any CAN errors in any DS logs however.
  • After Ventura, we went back to default control/status frame timings on the leader motors in an attempt to reduce CAN traffic. We also changed the control/status frame timings to 40ms and 255ms respectively on the follower motors.
  • We were on WPILib 2022.3.1 instead of 2022.4.1 (but it functioned perfectly at Ventura).

Neither students nor mentors are sure why this issue occurs on the field but not at our lab or tethered in the pit. What is different about being connected to the FMS? We are also unsure how to replicate and debug the issue without an FMS but we worry this issue could persist at off-season events or even next season.

Any insight or advice on how to move forward is welcome. Our students can provide more info as needed or link to specific parts of the robot code. Thanks!

2022 codebase

3 Likes

A little off-topic but we have around %80 CAN utilization as well and we use 2 neos and the rest of our motors are falcons or 775s driven by talon srx. Every time the robot boots up, we get a few CAN errors from both sparks and CTRE motor controllers, which doesn’t seem to be a problem since the ones that we get errors from are still can be configured properly. Do you get anything similar as well? We had an issue with a spark not driving the motor under load but it appears that the issue was solved when we changed it with a new spark max. Also, are you able to see your TalonFX firmware version in the diagnostics tab on FRC Driver Station? We see ours as “Inconsistent” even though all firmware versions are up to date, I suspect that sparks can be the reason for this.

2 Likes

Unfortunately I don’t have access to our drivestation at the moment so I can’t check the logs or check the diagnostics tab, but I do remember some errors on boot. I don’t think those are the issue however as we were able to run with the nearly the same code and the same errors and the same drivetrain at Ventura and our drivetrain also did run during teleop during Monterey, which leads me to believe that it probably isn’t motors being unable to work under load.

Based on the video and your description my gut says this is not a CAN issue. Since the robot operates in teleop open loop, the parameter sets for all of the gain parameters, or the feedback sensor config, would have had to fail for two drive motors, but not for the other motors. You may not see CAN errors logged though based on my hypothesis in the 1690 thread you linked. Check the status returns for the parameter sets and log them yourself may be the better option.

If you do want to go the extra step and try to confirm this, after a match where this happens, don’t power off the robot. Instead hook up the Phoenix Tuner and dump all the parameters to confirm that the gains and output ranges are set correctly. You could also print out these parameters in disabledInit or similar just to confirm during each step (before the match, between auton and teleop, and after the match).

I also don’t suspect the CAN frame timing, as I would expect something to happen instead of nothing, unless the drive motors were fighting eachother or something. But if the motors were fighting I’d expect that to occur all match. You could log motor current and applied output to debug. The new WPILib logging framework would be a good candidate to do this logging if you don’t have something for this already.

2 Likes

Any ideas on how to test this without an FMS? Team 8’s season is over, and we won’t have any FMS-connected matches until the fall offseason.

Next time we have lab, we’ll double check the status returns.

Honestly, log everything that you can. (I didn’t look through all the code, so you many be doing some of this already). Log the states of your drive control loops including setpoints and actual output velocity/position. Log the faults returned by the motor controller. Log the trajectory that is loaded and the auton selected. That way you can easily say whether the motor controller was even told to go or not. Logging local to the robot on a thumb drive is a nice approach.

OP mentioned being worried about this happening in future seasons too. The WPILib logging is going to be getting more tooling support, and some NT stuff may evolve as well. So this offseason could be a good time to check out those changes and implement it. Must easier to debug with more information.

Last thing I totally forgot. I mentioned logging motor current, but you can already grab this from the driver station logs. Pull up Q5 and see if there is any current in any of those motor controllers.

1 Like

Hey Will — thanks a bunch for the additional suggestions. As mentioned we will check the returned values from setting the pid gains. However we have not experienced this issue anywhere besides on the field at Monterey, so without being connected to FMS I’m not positive any error codes will be seen.

I recall looking at the pdp current on the DS logs — when the drivetrain didn’t run, I don’t think there was any current draw. We can certainly review the logs again though but I’m fairly certain it was not motors fighting against one another.

Besides building out our logging infrastructure for the next competition, is there anything that could be different when connected to the FMS? Our understanding is that it just runs auto followed by teleop, more or less.

1 Like

Some others can chime in here better than I can, but my understanding is similar to yours. The only difference is the bandwidth limits (are?) enforced, as is port blocking. The driver station networking is just generally different with the addition of VLANs for each bot, but this shouldn’t impact anything. I believe the radio config tool can be set to simulate the bandwidth limits and the port blocking. I can’t think of a reason that any of this would cause the issue you saw.

Can you point out which auton in the code is running?

The two main autos we were running at Monterey were the 4 ball auto and the two ball. Both were showing symptoms of the same issue: all the subsystems were running, except for the drive.

1 Like

One thing that jumps out at me with your auto routine is the use of try catch around the whole thing. Do you know why you added that?

You should definitely put in some logging of the exception if you do hit it so you know if it happens. Even a basic System.out.println("Hey something bad happened"); It’s hard to say if it hit the exception, since I don’t know how your scheduler behaves if some of the actions are scheduled but not others. i.e. did some intake actions get scheduled, but the drive routine threw an exception?

1 Like

That’s a good point, although I don’t think the code was getting caught there because it was shooting the first ball and moving the intake down to get a second ball, but the drivetrain was not physically moving (so the routine was getting returned properly, but something about the drive subsystem was not running).

The try catch is just used to catch file io exceptions when constructing the auto routine and loading paths, but since the rest of the auto runs I do not think an exception is thrown during construction.

We will try checking the bw limit button when reflashing the radio and see if that has any effect.

Any further suggestions you or others have on how to mimic the FMS setup as closely as possible are appreciated. Would the off-season FMS or chezy arena be feasible to set up in a lab?

I don’t have any experience with any of the FMS setups, but you could try it, the off-season one is probably the better call since I would think it to be closer to the official FMS.

However I do warn that I’ve seen a number of cases where folks (myself included) blame being on the field connected to FMS as the difference that causes the issue, only to find this is not actually the case, and it was just coincidence. (This issue is one that comes to mind recently Why you probably shouldn't use the second port on your OpenMesh OM5P Radio and embrace using an Ethernet Switch instead) I only point this out, as the amount of effort to try and replicate the FMS may be more than expected, and not actually turn up any results.

1 Like

You do have a lot of synchronization happening, and please forgive my ignorance here in Java land, but one of the primary motivators from moving from 2022.3.1 → 2022.4.1 was to use a different java runtime on the Rio due to synchronization deadlocks happening. I don’t remember all the details on that deadlock issue, we experienced it by not getting controller input, so it may be possible that you’re seeing it in some other way here in autonomous. I don’t know if wpilib was using formalized locks or types, or what was going on under the hood, but I do know for certain that our code hasn’t deadlocked again since we updated.

Secondarily, did you have anything to double check the coherence of what you were seeing at the competition?

Meaning, autoinit was adding routines to your mCommands instance, but did you ever verify that you saw the proper number of routines on the contructed auto sequence?

I believe we disable all the threaded services for competition but we can certainly review that and will definitely upgrade to 2022.4.1. Students can probably provide more info here as I am unfamiliar with that part of the code.

We log when sequential routines transition and the sequence of events and timing looked correct in the DS logs. We also confirmed the DrivePathRoutine started and ended as expected during one of the matches where auto failed.

We had a similar problem with our drivetrain at a competition a few weeks ago and we discovered that it usually occurred when we let the robot sit for a long time between powering it up and enabling autonomous or teleop. In practice, we almost always enabled the robot immediately after booting, but field setup was taking a long time at the competition and the robot usually sat a while before the game started.

We haven’t had a chance to determine root cause, but we thought there may be some sort of timeout occurring. We found what seems to be an effective workaround by constantly updating our drive motors with 0 velocity from within disabledPeriodic(). We also removed a color sensor that was using the IIC bus, and that has eliminated a number of odd behaviors.

It’s possible that the addition of the disabledPeriodic() code was coincident with whatever change really addressed the problem, but I thought I’d share in case it helps.

Joe — thanks for sharing, that could certainly be causing our problem as well. We will investigate leaving the robot on for longer periods of time and update.

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.