Networking fault during competition

Hi folks, one of the mentors from 488 here

During our second quarterfinals match at the PNW DCMP, we got disabled for about 10s due to a connectivity loss between the Robo RIO and the DS. We’re trying to get to a full root cause of what happened so we can make sure it doesn’t happen again, but I’m a little stumped by the logs here.

The logs from our RIO pretty clearly indicate that for 10s during the match, the RIO lost connection with the DS, but still had a connection to the FMS.

[no DS, FMS, disabled, Elimination 7 1] ... | Disabled init (Red3, 63.0s, DS disconnected, FMS connected, Is disabled: true, Is enabled: false, Is auto: false, Is teleop: true, Is test: false, Is browned out: false, Is output enabled: false, Battery voltage: 11.800892333984375)

On the DS, we see agreement that the DS cannot connect to the RIO:

Warning 44004 FRC: The Driver Station has lost communication with the robot. Driver Station
Ping Results: link bad, DS ratio(.4) GOOD, robot radio(.1) GOOD, roboRIO(.2) bad, FMS-GOOD...
Warning 44003 FRC: No robot code is currently running. Driver Station

After that match, we had a CSA go through the driver station logs with us, and he was very helpful in explaining to us the DS logs, though he didn’t have the RIO logs available to him at the time.

The CSA’s suspicion was that there was a network wiring fault between the RIO and the radio on the robot, which makes sense given the DS logs, but warning 44003 didn’t make sense - we’ve got one of the slowest robot code boot times in the league (if not the slowest). 10 seconds of disable was way too low for the code to have restarted.

I will also note that the drive team observed after the match that the ethernet cable fell out of our drive laptop when they picked it up, suggesting that connection could have been loose.

Now that we’ve also looked at the logs from the RIO itself, we’ve confirmed that the robot code was in fact running the whole time and didn’t restart.

Is it a known issue for the DS to erroneously report error 44003 if a component loses network connectivity?

How could both the RIO and DS continuously report the ability to connect to the FMS but not each other? (I have some theories for this one, but I’m curious if other teams have seen similar situations.)

1 Like

Could you upload the .dslog and .dsevents files from the log file directory on your DS machine? It’d be much easier to review the actual file than just go off your textual description.

2 Likes

Where is that “FMS Connected” message in your Rio logs coming from? Is that something you’ve implemented yourself?

My experience has been that 44003 means no robot code reachable, i.e. similar to a red “robot code” indicator in the DS. If you have confirmation that the code didn’t stop running, I wouldn’t give much weight to the message itself.

Your DS log snippet doesn’t include timestamps (which the log files would have), but if the “Ping results” message is close to the same time as the 44003 message, I suspect the “no robot code” message was caused by the same event that caused the “link bad”/“roboRIO bad” status updates.

Presuming that you’re using the DriverStation class isFMSAttached() method, this isn’t active state of connection, but whether the DS is in FMS connected mode (ie it’s latched). The roboRIO doesn’t communicate directly with the FMS. All communication is through the DS.

1 Like

Do you have any devices between the roboRIO and the radio (like a network switch)? Is the Ethernet cable in the radio (for roboRIO comms) in the first port (immediately beside the barrel connector port)? These would probably be the most “obvious” physical things to check.

The robot logs come from the base library that we built up over the past decade, in particular:

All of the values effectively come from edu.wpi.first.wpilibj.DriverStation. We’ve then got log4j logging to file.

The robot log for that match is here: https://1drv.ms/t/s!Av5V1v8uFIL0gbgXK6sQNrOmyTfvgQ, 219 is where we get disabled. (Yes, I’m aware that most of that file is useless from a logging perspective :smiley: )

Apologies for the lack of timestamps and attachments here from the DS, I was typing based on some pictures of logs that we took at competition while in a rush to our next match. I’ll have access to the drive laptop again on Tuesday to pull DS log files, but right now it’s packed away in a trailer.

It’s concerning that your Ethernet cord would fall out. Many teams have found that the built-in Ethernet socket on drive laptops becomes unreliable (especially the “fold-down” type), and they switch to using USB Ethernet adaptors.

Too late now, but next time a cord falls out, they should check that the Ethernet cord supplied in the Driver Station is in good condition and has a functional locking tab. If not, the field staff should replace it.

Also, how is the roboRIO connected to the radio? A number of people have raised concerns about the second port being unreliable. The roboRIO should either be on the first port (perhaps with PoE), or on a switch connected to the first port.

Also, from the log you posted, I see multiple references to brownout. Reviewing the DriverStation logs will give much more information about that.

We do have a network switch between the rio and the radio. In fact, we had to re-wire this during the event, after inspection noted that we were using the DC barrel jack and a PoE injector in parallel to power the radio, in a way was which illegal. (The nuance here is that the PoE injector was 18V and the barrel jack is 12V - apparently internally in the radio the two power supplies are connected only by a fuse, so having these mixed voltages could be problematic. R616 appears to be the relevant rule.) Given that we did have to adjust the wiring here, it’s entirely plausible that the cable wasn’t securely plugged in.

So with a switch in the way of the radio-rio link, there could be one of two issues popping up.

  1. If the switch loses power (like you described from a loose barrel connector) and takes around 10s to reboot, that would manifest as losing rio (for around 10s). If it takes much longer, though, that wouldn’t be the issue.
  2. If (for some reason) the traffic on the switch deprioritizes rio-radio traffic in favor of other communications, that could also show the behavior you saw. It’s a bit harder to check, and also less likely, especially if using an unmanaged switch.

I’ve seen a few cases of the first issue popping up this season, so that’s now my go-to check, especially if the rio comes back in under 20 seconds.

1 Like

I’m not sure what setup you had that the RI objected to, but Q92 says that R616 does not preclude redundant power. Having said that, my understanding is that only one power supply is used at a time by the radio hardware, and switchover is far from instantaneous, so redundancy does not buy you much.

But let’s investigate the brownout first.

Redundant power using the rev radio power module for Poe and a vrm for the barrel jack is explicitly against the rules. This is because the rpm outputs 18v over the Poe, and the vrm outputs 12. Internally in the radio the Poe power and barrel jack are bridged. So you’d be backfeeding 18v into the 12v vrm, which could cause damage.

Using a Poe injector powered via the vrm directly, along with the barrel jack is perfectly fine, and is what that Q&A allows. But that is only for a passive Poe injector and not the rev radio module.

2 Likes

It says redundant power is ok when the VRM is providing both power sources. That’s not what the OP described, since there were multiple voltages mentioned.

Thanks folks. From the feedback here, it does sure sound like we should be focusing on the robot connection to the radio and not necessarily the DS itself, and the report from the robot that suggested it still had a connection to the FMS was misinterpreted by me.

I wasn’t out on the field during quarters, or I would have come and taken a look directly. With a switch, a 10 second failure, that is kind of making me think the switch temporarily disconnected. Some switches might have a bootup time to recreate their tables. Did you happen to lose connection to any cameras hooked up to the switch, or any logs that would show if you were getting updates from any other devices on the switch?

We don’t have this today. It sounds like the latest WPILib can log all changes to networktables now, which could have provided this in a roundabout way, but we didn’t have the resources to run validation on a new build of WPILib this late in the season (and for the sake of our drive team’s sanity, we’re code frozen except for critical fixes for Houston).

Are you certain they’re bridged? I’ve heard from multiple people (at least one of whom I trust to know these sorts of things) that they’re separate internally. I haven’t looked myself though.

I cannot answer are the barrell Jack and low voltage POE bridged question. But the two POE connection are not bridged since one is expecting 48V and one is <24V. Maybe that it where some of the confusion is coming from.

Yeah, I knew about those 2, but I’ve definitely heard conflicting answers about the barrel jack and passive POE port, both from typically well-informed sources.