Severe (>100%?) packet loss

We’ve been facing a myriad of obnoxious robot gremlins, most notably severe packet loss between the robot and the DS.

Our “standard” networking set up involves the following devices:

  1. Radio
  2. RIO (v1)
  3. Limelight v3
  4. Ethernet panel mount
  5. TP Link FIRST Choice network switch
  6. Monoprice Micro SlimRun Cat6 cables

The radio, rio, and network switch are all the exact same devices as used on our 2022 competition robot, which did not experience any of these issues. Our switch and radio are powered by the CTRE VRM connected to the REV PDH - the radio using a REV POE injector.

We typically connect wireless; the panel mount is typically used for competitions.

We began to notice during testing that our operator would press a joystick button to set the elevator setpoint, then nothing would happen for ~3 seconds, and then the elevator’s setpoint would change. But this doesn’t happen every time. We went into the DS logs and were greeted with this beautiful graph:

I have approximately 50 more logs of the exact same nonsense if anyone would like the raw files.

As you can see, we seem to have gotten 120% packet loss, which I am not really sure how is even possible, unless there are retries going on that are also getting dropped.

We are experiencing (what we believe to be) unrelated swerve module issues, where offsets are not being initialized correctly. We believe this to be CAN frames being dropped (we are at 70%ish utilization), and the message that sets the spark’s initial encoder value is not being received in time, or something. I am mentioning this in case it is somehow the same issue propagating elsewhere, but we think it’s totally separate and can be fixed by lowering CAN utilization (which we will get to tomorrow, likely).

We use NT quite heavily due to AdvantageKit.

We have taken a few steps to debug this.

  1. We connected directly to the switch via the panel mount. This lowered packet loss to mostly 0% but occasionally 5-10% packet loss spikes. I do not know if this is normal.
  2. We took the network switch, limelight, and panel mount out of the equation, by connecting the rio directly to the radio’s POE injector. Packet loss still occurred, even while idling and disabled.
  3. We deployed an empty robot project (command based template advanced), still packet loss.
  4. We connected the Rio to the radio’s second port, such that it does not use the POE Injector for comms. Still packet loss.
  5. We swapped out to a different radio. Still packet loss.
  6. We swapped to a different driver station. Still packet loss.
  7. We moved to a new building. And then did it again. Still packet loss.
  8. We swapped to different Ethernet cables. Still packet loss.
  9. We connected from DS straight to Rio, without the radio involved at all; no packet loss.

We are running the latest firmware and versions of everything. Latest wpilib version, latest gradlerio version, latest AdvKit version, latest DS, etc. This is not a laptop issue, as we are using the same DS we did last year without any problems at all, and we’ve disabled firewall and other possible stuff.

We are getting packet loss in the same room as another team’s robot that is not experiencing any of this.

We are completely out of ideas and are getting rather nervous about connectivity problems on a real field in week 1. Help would be appreciated.

Code, in case its some dumb code issue. GitHub - FRC2713/Robot2023

4 Likes

I would use a tool like Wifi Analyzer to look at the channel the radio is on and other networks around that channel.

1 Like

I’ve been working with Justin. This seems unlikely given #7 but it’s where I went originally too.

1 Like

Can you post some of the logs (both dslog and dsevents).

For #9 is this with an ethernet cable or USB?

2 Likes

It depends on whether the different buildings are really independent. For example if they were all from the same school with the same wifi mesh network setup, that wouldn’t necessarily be a valid test.

1 Like

You’re right. It was a different build space shared with another team that was not seeing these issues as I understand it.

ds logs.zip (3.3 MB)

Somewhat unorganized, sorry. Thursday’s logs have a whole lot, but they’re there on Monday too. 9 is with Ethernet.

It’s happened

  1. In our shop at our school, which technically is in range of access points, but weak enough signal such that nobody actually has a functional internet connection. Tiny concrete box.
  2. In a mentor’s garage in a very residential area with not that many APs around.
  3. In a different school 10 ft away from another team’s robot which was not having said issues.

We can try wifi analyzer tomorrow.

2 Likes

Maybe log the traffic with Wireshark?

3 Likes

Do you have any ds logs from when you ran with a blank project? A few observations from the logs you sent

  1. You have a lot of text warnings/errors/logs being sent. I’ve read in a few places e.g. 1, 2, 3, that string concatenation is slower this year in Java or something along those lines. It may just be confirmation bias, but I definitely feel like it is worse this year for whatever reason.
  1. Your free memory on the RIO is quite low in all your data
  2. The USB camera messages keep being printed, is this expected?
  3. 8:31:44 on 02_23_21_30_40 shows DS laptop CPU at 98%. This data is not at all granular enough for any kind of correlation, but may be something to peek at?

No idea if any of the above are related to your issues, but I have seen funky things with network traffic and either CPU usage or lots of things being printed to the DS. It’s likely worth cleaning up the various errors either way, and it may reduce some of the issues you’re seeing as a happy side effect?

2 Likes

When you took the switch out of the equation did you disconnect power to the switch? Wondering if you have too much on VRM, or maybe the VRM is not functioning correctly.

2 Likes

I don’t have logs for an empty project but I can get them tomorrow and post.

  1. There’s some stale CAN messages that we induced in the later logs; they are there because we lowered some CAN frame timings but are still logging more than we need to, so upon accessing stale data (in order to log it) it prints the error (or at least thats my understanding). We plan on cleaning up that soon but it was happening before that anyhow. // We do a decent amount of string concatenation but… I don’t know. I feel like concatenating Strings shouldn’t be causing issues unless you’re running 1950s hardware.

  2. I don’t actually know where to see the RAM in the logs, or DS CPU. I see rio cpu, but not DS.

  3. The USB camera thing is silly, I forgot we had that line in there, we can remove it, we don’t have a camera aside from Limelight. Leftover code from last year I guess.

We can try a new VRM on Monday. We left the power to the switch on.

It gets logged periodically to the event list like this:

image

But during a test just look at task manager to get better data.

2 Likes

Yes, it’s different. Java 17 removed the workaround we used to use to speed this up (-Djava.lang.invoke.stringConcat=BC_SB). We added a compiler option ( -XDstringConcat=inline) to try to get back to baseline performance but it may not work exactly the same under all situations or may make some situations worse than last year in order to avoid the worst worst case.

Unfortunately, without the above option… you can get hundreds of ms of delay for the first call to long string concatenations (later calls don’t have the slowdown). See Loading...

5 Likes

Unfortunately free memory is not what we (or the DS) should be looking at / logging. It should be looking at available memory. Free memory treats memory consumed for caching as “used”, so is a pretty useless metric, whereas available memory is the true measure of how much memory applications can still allocate (because the caches can be discarded). I forgot to ask NI to change it this year, hopefully we’ll remember to change this for 2024.

3 Likes

I have seen DS laptop slowdown cause packet loss and control lag. Try quitting all other apps (including shuffleboard) to see how that affects things.

1 Like

Something I don’t see on your list of things tried: switching the radio to the other frequency band. Our school shop had terrible packet loss on 2.4GHz but is not too bad on 5GHz. The loss pattern was quite different from what you’re seeing though – the loss rate was constantly high rather than peaking so dramatically.
But it’s one more fairly simple test that you can do and would probably point you away from radio interference as the most likely cause of the problem if it didn’t make a difference.

1 Like

I’m once again asking for a consolidated place to provide feedback on the DS that is taken seriously. The NI forums aren’t it.

7 Likes

This may be a long shot, but did you try swapping out the Rio?

2 Likes

Assuming you didn’t make any modifications before deploying this code to the Rio I think this eliminates your code as the major contributing factor.

I would probably treat this as a hardware/firmware issue as my number one troubleshooting. You said you swapped out cables, radio, and switch. You didn’t mention whether you re-flashed/swapped out the Roborio.

The common intersection across all your tests as I read them is it is the same Roborio.

We haven’t yet, but it’s on the to-do list for today.