Robot Keeps Losing Com and Code (Necro)

Since my original post has been locked, I’m necroing it here.

As far as I’m aware, a solution to this problem was never discovered. We have been able to work around the problem during our IR@H, but may not be able to for offseason comps.

Whoops, here’s another necro

Let me know if you find anyone willing to pick this up and run with it. I’ve reported it to NI as a bug because I’m unable to open a service request (and just learned the case was labeled as “created in error” and dropped, AFAIK). I’ve left comments in the FIRST blog (specifically, the one reporting a “new” control system is on the horizon), and I’ve messaged an individual here (Delphi) who I thought could help, but all to no avail.

We first saw this last season, in our only regional event, and while we were later convinced it had been happening during practice, it actually only bit us once while competing. When Covid shut everything down, the thread you mentioned dried up.
When our team was able to reboot, I remembered this issue and began looking for answers. This was late last year and at this point, I’ve about given up. We just competed in a local event (brand new rio, new bot, new code) and lost comms in four different matches, and the only thing we’ve been able to figure out relates to the update rate on our navX - if we use the suggested 100 Hz rate, we see sporadic “hangs” but have found that by lowering it to 30 Hz the problem seems to go away. I’m skeptical of this being the “root” of the issue, as the build we ran last year didn’t even use a navX (or anything else SPI, I2C, USB or even MXP). We DID use a Limelight, however… I don’t know if this caused the problem or not. It’s possible, I suppose, as I think this will eventually be identified as a conflict somewhere within the communication protocols. The necroed thread above has posts suggesting color sensors, cameras as well as the aforementioned navX as all being suspect. Since we’ve used the navX for years without issue, I’m more inclined to think something in the NI update changed, as we have never experienced this before.

I’ll check back now and then to see how it’s coming. Something tells me we haven’t seen the last of this problem.

Mark

Huh. That would be very strange if true.

From here - sounds like the answer for now is, in part, don’t use the non-MXP I2C port?

Just checked the case again, status now reads “customer comment” - appropriate after I questioned the “created in error” yesterday… point being, I have no idea if NI is looking into this or not.

The post you referenced only points to a problem with I2C while neglecting the same “lockups” when the I2C protocol isn’t used. We don’t use it, never have, but we are seeing the Rio hang. We’ve discovered that what we’re experiencing is “affected by” the navX update rate, and comms there are thru the MXP SPI port. (I say “affected by” because last year we witnessed the exact same situation occurring and did not have a navX on board, just the Limelight)

This issue goes beyond Lidar lockups and posts suggesting it’s been fixed simply are not 100% correct.

Mark

1 Like

Mark,
I agree with you that the problem is probably not strictly I2C-related. Because the only reproducible test-case so far uses I2C there is a bit of a tendency to focus in on that. And indeed, as I noted in the other threads, avoiding attempts to access I2C devices that are not present seems very prudent and for some of us has reduced (or eliminated) occurrence of lockups. But there are anecdotes in the other threads of lockups when I2C is not being used.

If, once you get access to your 'bot again, you can document in detail the devices and buses you are using it might be helpful. And, as we found with I2C, it is also important to note devices that the code is attempting to access that may not actually be present.
Even better would be code that reproduces the lockup without using I2C. It would also be helpful to know if you observe the lockups on more than one Roborio and whether they seem to occur with the same frequency on each. For us, we have seen rios that lock up, running the I2C stress-test program, almost instantly (within a handful of seconds), ones that take minutes to lockup, and others that never lock up, all running the same code and in the same hardware configuration.

I will also forward a link to this thread to the engineer at NI with whom I was discussing this throughout 2020 and early this year. He’s a FRC mentor and yes, things did die down as we got into 2021 as both of our attentions turned to robots for this year, but hopefully interest will pick up again now that competition is over for this year.

My team has run into the same issue.
Once every few minutes, the Driver Station will lose comms and robot code for 3-5 seconds. During this time, our robot code has not crashed, shuffleboard is still running and is being updated with new values.

Interestingly (maybe worryingly), when the driverstation disconnects, the robot can still be driven for a little bit.
The timeline of events goes:

  1. The driverstation is connected | We can drive the robot in teleop enabled
  2. The driverstation loses comms and robot code | We can still drive the robot (the robot code is running teleopPeriodic and the driverstation is sending joystick data to the robot)

After about 5 seconds (Long enough that someone watching the driverstation can say “You’re about to be disabled” to the driver),

  1. We can no longer drive the robot (The robot switches to disabled).

After 3-10 seconds,

  1. The driverstation regains comms and robot code and we can press enable.

This is not the same behavior that is being discussed here. The behavior here is that the roborio completely locks up until it is manually reset or powercycled.

There are a lot of threads that discuss intermittent communication failures such as yours with good ideas on how to interpret the driverstation logs, ideas to try, etc. Or, if you start your own thread about the problem you’re seeing I expect you’ll get a lot of direct help and guidance.

1 Like