Robot Keeps Losing Com and Code (Necro)

Since my original post has been locked, I’m necroing it here.

As far as I’m aware, a solution to this problem was never discovered. We have been able to work around the problem during our IR@H, but may not be able to for offseason comps.

Whoops, here’s another necro

Let me know if you find anyone willing to pick this up and run with it. I’ve reported it to NI as a bug because I’m unable to open a service request (and just learned the case was labeled as “created in error” and dropped, AFAIK). I’ve left comments in the FIRST blog (specifically, the one reporting a “new” control system is on the horizon), and I’ve messaged an individual here (Delphi) who I thought could help, but all to no avail.

We first saw this last season, in our only regional event, and while we were later convinced it had been happening during practice, it actually only bit us once while competing. When Covid shut everything down, the thread you mentioned dried up.
When our team was able to reboot, I remembered this issue and began looking for answers. This was late last year and at this point, I’ve about given up. We just competed in a local event (brand new rio, new bot, new code) and lost comms in four different matches, and the only thing we’ve been able to figure out relates to the update rate on our navX - if we use the suggested 100 Hz rate, we see sporadic “hangs” but have found that by lowering it to 30 Hz the problem seems to go away. I’m skeptical of this being the “root” of the issue, as the build we ran last year didn’t even use a navX (or anything else SPI, I2C, USB or even MXP). We DID use a Limelight, however… I don’t know if this caused the problem or not. It’s possible, I suppose, as I think this will eventually be identified as a conflict somewhere within the communication protocols. The necroed thread above has posts suggesting color sensors, cameras as well as the aforementioned navX as all being suspect. Since we’ve used the navX for years without issue, I’m more inclined to think something in the NI update changed, as we have never experienced this before.

I’ll check back now and then to see how it’s coming. Something tells me we haven’t seen the last of this problem.

Mark

Huh. That would be very strange if true.

From here - sounds like the answer for now is, in part, don’t use the non-MXP I2C port?

Just checked the case again, status now reads “customer comment” - appropriate after I questioned the “created in error” yesterday… point being, I have no idea if NI is looking into this or not.

The post you referenced only points to a problem with I2C while neglecting the same “lockups” when the I2C protocol isn’t used. We don’t use it, never have, but we are seeing the Rio hang. We’ve discovered that what we’re experiencing is “affected by” the navX update rate, and comms there are thru the MXP SPI port. (I say “affected by” because last year we witnessed the exact same situation occurring and did not have a navX on board, just the Limelight)

This issue goes beyond Lidar lockups and posts suggesting it’s been fixed simply are not 100% correct.

Mark

1 Like

Mark,
I agree with you that the problem is probably not strictly I2C-related. Because the only reproducible test-case so far uses I2C there is a bit of a tendency to focus in on that. And indeed, as I noted in the other threads, avoiding attempts to access I2C devices that are not present seems very prudent and for some of us has reduced (or eliminated) occurrence of lockups. But there are anecdotes in the other threads of lockups when I2C is not being used.

If, once you get access to your 'bot again, you can document in detail the devices and buses you are using it might be helpful. And, as we found with I2C, it is also important to note devices that the code is attempting to access that may not actually be present.
Even better would be code that reproduces the lockup without using I2C. It would also be helpful to know if you observe the lockups on more than one Roborio and whether they seem to occur with the same frequency on each. For us, we have seen rios that lock up, running the I2C stress-test program, almost instantly (within a handful of seconds), ones that take minutes to lockup, and others that never lock up, all running the same code and in the same hardware configuration.

I will also forward a link to this thread to the engineer at NI with whom I was discussing this throughout 2020 and early this year. He’s a FRC mentor and yes, things did die down as we got into 2021 as both of our attentions turned to robots for this year, but hopefully interest will pick up again now that competition is over for this year.

My team has run into the same issue.
Once every few minutes, the Driver Station will lose comms and robot code for 3-5 seconds. During this time, our robot code has not crashed, shuffleboard is still running and is being updated with new values.

Interestingly (maybe worryingly), when the driverstation disconnects, the robot can still be driven for a little bit.
The timeline of events goes:

  1. The driverstation is connected | We can drive the robot in teleop enabled
  2. The driverstation loses comms and robot code | We can still drive the robot (the robot code is running teleopPeriodic and the driverstation is sending joystick data to the robot)

After about 5 seconds (Long enough that someone watching the driverstation can say “You’re about to be disabled” to the driver),

  1. We can no longer drive the robot (The robot switches to disabled).

After 3-10 seconds,

  1. The driverstation regains comms and robot code and we can press enable.

This is not the same behavior that is being discussed here. The behavior here is that the roborio completely locks up until it is manually reset or powercycled.

There are a lot of threads that discuss intermittent communication failures such as yours with good ideas on how to interpret the driverstation logs, ideas to try, etc. Or, if you start your own thread about the problem you’re seeing I expect you’ll get a lot of direct help and guidance.

1 Like

Apologies for the length of time for this reply - now that the season has officially ended, I’ve discovered that there IS life beyond robotics! First and foremost, I want to reassure those reading this thread that I mean no disrespect to the good folks at NI. Their response to my inquiries suggested I pose the question to the FRC support forum, thus explaining the notation of my report being “filed in error”.

To clairify, if I’m able, a few details that may or may not be pertinent. We are strictly a Labview team - past, present and foreseeable future. We have only seen this problem when utilizing the Navx gyro via the expansion port, using SPI protocol. We’ve been able to replicate the problem over multiple (at least 3) Rios, but I can’t speak to the frequency or timing of each. We too, have also found a way to avoid the issue, but the work-around could interfere with the functionality of the Navx. Therefore we’re hopeful that the root cause might be discovered and a proper fix instituted.

The weeds - I can almost guarantee that the program(s) used were not calling “ghost” devices, as that’s something we would have recognized and corrected. That said, the Rios affected were running various other devices, some no doubt creating longer processing times than others, but all affected by altering the refresh rate of the Navx. We have learned that the Navx was updated a few years back (now Navx v2), utilizing a chipset that allows faster processing and update rates, and we did not change the suggested rate until this year (from 50Hz to 100Hz). There are unanswered questions around the effect this faster update rate might have on the “older” chipsets. (One of our mentors has been in communication with Kauai Labs, developer of the Navx, and he’s reported their interest in discovering the cause as well.)

Our builds, for those Rios that exhibited lockups -

March 2020 - older Rio, with prior usage
a) 11 motor controllers (can based, Talon SPX/Victor SRX combinations)
b) 2 grayhill ecoders utilizing the port on the Talons
c) pneumatic control unit (compressor, three solenoids - can based)
d) proximity sensor (dio)
e) Limelight (enet, via separate switch)
f) Navx (mxp)

March 2021 - new, unused Rio
a) 7 motor controllers (6 can based, 1 pwm - 3 talons, 3 victors, 1 talon sr)
b) 2 e4p encoders (dio)
c) 1 potentiometer (analog)
d) 2 limit switches, wired to talon srx
e) Navx (mxp)

April 2021 - older Rio
Configuration unknown (mentor tested his own, at home - same results)

So in this long-winded post, these are the details I can provide at present. In attempting to replicate the issue, I would suggest a setup with only the Rio and the Navx, perhaps using the max update rate (200Hz) to start. Hopefully this will produce the results we’ve experienced and further the search for a solution. Again, I don’t deny the discoveries related to the I2C protocol or the solutions therein, but merely hope to assist those who may be experiencing the same issue under like circumstances. I would also note that we’ve used this configuration numerous times in the past (prior to 2020, no rate above 50Hz) without issue, which initially led me to believe it could be related to the yearly update required by FIRST and distributed by NI.

I’ll keep an eye on this thread, and thanks to those who’ve offered suggestions. Keep plowing forward!

M~