Why you probably shouldn't use the second port on your OpenMesh OM5P Radio and embrace using an Ethernet Switch instead

I noticed quite a few teams recently are running into an issue that I’m all too familiar with and I promised a few folks I would post this. I suspect this post will unintentionally ruffle someones feathers for some reason but I hope it doesn’t. I hope teams are able to read it and figure out some of the issues they have been having.

The problem:
So, you’re connecting your RoboRIO to your OpenMesh OM5P and you have a LimeLight or LemonLumen or PhotonCam or Jetson or Pi or whatever and you are connecting it to the second port when…

Sometimes, you boot up your 'bot and your co-processor just doesn’t connect. You check the radio and it’s up. You check the RoboRIO and it’s working. You then try running a ping command to your visionWidget and it doesn’t work. You verify it is powered but it just isn’t connected. Your next step is probably rebooting the robot, or unplugging a cable and replugging it, maybe replacing a cable. Sometimes that works… in fact, you probably blame the cable or blame a bad radio after you swap it to a new one.

2 or 3 matches later, it comes back. You dismiss it as gremlins and then swap another cable or something else and you keep scratching your head to the point of trepanning.

Note that some of these sections are “hidden” and you need to click the twisty to display them.

The Symptoms

Intermittently, something plugged into the second port on the OpenMesh OM5P radio just can’t be reached from the RoboRIO. Rebooting the robot might fix it but it eventually comes back. It’s not predictable, it’s intermittent. It only happens when you seem to be putting blue bumpers on your robot or when the safety captain programs the radios. It’s intermittent. You’ve probably correlated it to something because that’s what humans do.

Why didn't this happen before now?

Well, it did. My first encounter with this bug was back in 2017. In fact, there was a Team Update in 2017 around Week 1 or Week 2 of competition season that enabled teams to use a switch connected to one port with the RoboRIO plugged into the switch instead of directly into the radio. Ever since then, I’ve been aware of it and I’ve tried to keep up with patches for it. As far as I know, all of the patches are in the current radio firmware we are using but it still seems to happen.

My best guess is that some update to your visionWidget is likely causing the timings to be different on the bringup of the ports and negotiation of the link speed and it is more common and more frequent suddenly. Like a creepy clown hiding in a sewer, it’s been there since the beginning.

Executive Summary:
The Qualcomm QCA955x (QCA9557 and QCA9558) chipset that exists within both the OM5P-AN and OM5P-AC Radios sometimes has a bug where the chipset and the ethernet interface do not come up as expected. It results from a link state change (is a device connected or not) or from a link speed change (ethernet supported speed auto-negotiation). No one seems to know what causes this and there are quite a few patches to the OpenWRT project (https://openwrt.org), the Gluon project (Welcome to Gluon — Gluon 2023.1.1 documentation), and a few others to try and resolve it.

Longer detailed technical explanation:

This is going to get ugly and a bit hand-wavey for some folks. The QCA955x is a System on Chip (SoC) from Qualcomm. It does not include any method to connect to wireless systems or anything to allow it to directly connect to ethernet ports. For those we need a wireless chipset as well as physical ethernet interface chipsets. These physical ethernet chipsets are sometimes called PHYs and they come in quite a few “media independent” flavors: Media-independent interface - Wikipedia

So now we get to the good part and I’m just going to quote from some development mailing list sources here:

This commit adds a workaround for the loss of the SGMII link observed on
the QCA955x generation of SoCs. The workaround originates part from the
U-Boot source code, part from the implementation from AVM found in the
GPL tarball for the AVM FRITZ!WLAN Repeater 450E.

The bug results in a stuck SGMII link between the PHY device and the SoC
side. This has only been observed with the Atheros AR8033 PHY and most
likely all devices using such combination are affected.

It is worked around by reading a hidden SGMII status register and
issuing a SGMII PHY reset until the link becomes useable again.

[OpenWrt-Devel,2/4] ath79: add QCA955x SGMII link loss workaround - Patchwork

and another:

The SGMII link of the QCA955x seems to be unstable when the PHY changes the
link speed. Reseting the SGMII and the PHY management control seems to
resolve this problem. This was observed with an AR8033 and QCA9558

[OpenWrt-Devel,RFC] ar71xx: Reset QCA955x SGMII link on speed change - Patchwork

and another:

The QCA955X is affected by a hardware bug which causes link-loss of the
SGMII link between SoC and PHY. This happens on change of link-state or
speed.

It is not really known what causes this bug. It definitely occurs when
using a AR8033 Gigabit Ethernet PHY.

[OpenWrt-Devel] [PATCH 1/2] ar71xx: fix QCA955X SGMII link loss

And there are similar devices on the market with similar issues because they are using the same chipsets:

ar71xx: enable QCA955x SGMII fixup on Rambutan

fixes intermittent loss of connectivity on 1Gbit port

git.openwrt.org Git - openwrt/openwrt.git/commit

Basically, it’s a bug with the link negotation of the second port. I can’t explain why it happens any better than any of the developers on these mailing lists can. It’s intermittent and results in loss of the link state - so that’s why your visionWidget sometimes doesn’t connect.

The reccomended fix:
Don’t use the second port on the OM5P Radios. Use an ethernet switch and plug the switch into the primary port. Plug everything else on your robot into the switch.

Is there a guide for how to do that?

Yes. See this paper: ZebraSwitch - Reliable Robot Ethernet that won't break your BOM

Anything else?

Yeah, I’d probably stick to using static IP addresses on the field and not rely on mDNS/Avahi/Zeroconf or DHCP. Entirely up to you though. Also, we continue to recommend this white paper on our “ZebraSwitch” methodology for how to make a reliable robot ethernet switch for an FRC robot. It’s proven to be stable for many teams, including us. You can also choose your own path to making a reliable ethernet switch but the important thing is, you might want to use a switch.

74 Likes

Zebra switch setup will be on both 111 and 112 before Midwest. Had loads of issues at CIR and during practice. Thought it was the cable we made, but I guess not!

3 Likes

We have this kind of setup, and we’ve been specifically told by FTAs to plug the Rio into one port of the radio, and our switch into the other port of the radio.

So what I’m hearing from you is that we should NOT do that, and instead have our switch between the radio and RoboRio? If true, they should stop recommending that to teams.

9 Likes

We’ve tried the switch method in past years. Works great for us as a team. Sucks when we get to events though. First thing any FTA or CSA has us do if there is an issue is unplug our switch and immediately assume that’s what’s causing any issues related to comms. I’ve been told more times than I can count during inspection that the switch is an added failure point and that we should only be using our radio if we can (normally only have the Rio and Limelight, so it’s easy enough for them to say we don’t need a switch cause we only have two devices).

6 Likes

So I’ll defend the FTAs here. They are probably doing what they think is best and ensuring your RoboRIO is connected no matter what. That’s awesome. It probably means you’ll drive.

Some teams don’t take power regulation seriously for things like ethernet switches and co-processors but others do.

Sadly, many FTAs are swamped at events and don’t have the time to dig into a robot and figure out exactly what is broken and they are looking for a tactical solution to get a team running. This is a good tactical solution.

Unfortunately, it’s not a good strategic solution.

And per the blue box in R703, the switch is legal. Teams can keep it there without any issue but the FTAs might not like it if they are trying to troubleshoot a problem.

11 Likes

You have a third device at competition…your DS for tethering in the pits.

9 Likes

“Just use a USB cable”… or so I’m told :woman_shrugging:

6 Likes

Which do you consider the “First” ethernet port on the radio? The Left or Right one? The “18-24V POE” or the “802.3af POE” ? Or do you mean the “first” (and only in your recommendation) connection made?

5 Likes

Thanks Marshall. I believe we experienced this problem this past Saturday and resolved it by changing the architecture as you’ve recommended.

2 Likes

The 18-24V POE port is the “first” port (FIRST port?).

6 Likes

We are using a Brainboxes SW-005 switch this year, and highly recommend it. 5 ports, convenient mounting flanges, and 5-30V regulated power using screw terminals.

Edit: Here is a CAD model for the switch, including a 3D-printable bracket that attaches the radio to the switch using the mounting flanges.

19 Likes

This is very concerning, and good advice. But it might not be the best choice for everyone. In our case:

  • The Rio is connected to the “First” (left) radio port
  • The switch is connected to the “Second” (right) radio port
  • The Limelight is connected to the switch.
  • if the Rio doesn’t get a target from the Limelight, the shooter defaults to a Fender shot.

So in this architecture, a failure of the Radio Port 2 causes us to have to make Fender shots, but otherwise we can keep playing. But if we were to reorganize the radio to feed through the switch, we introduce a couple of new failure modes (the switch, a second wire, and two connections) between the Rio and the Radio. If any of those fail, we can’t play at all. With that in mind, I’m tempted to take our chances with the switch on the 2nd radio port.

14 Likes

We have the RoboRIO on the first port and a Limelight on the 2nd port. On Saturday we powered up our robot during practice one time and the Limelight didn’t connect or show up in the dashboard. Powering off the limelight at the MPM fuse and also restarting the dashboard at the same time fixed it (not sure which one it was). It certainly sounds like this problem. I think we had this in competition in 2020 with our Raspberry PI too, and we installed a switch, but never got to compete again. Thanks for reminding us of ZebraSwitch.

We have two of the SW-005’s… we will try installing one of those if we get time before the next competition.

It would be nice if there was an update to the FTAs about this.

4 Likes

I think this is a fair point but not everyone can shoot from the fender and not every game enables designs for that. Teams are getting more and more reliant on vision systems and co-processors and reliable ethernet is important.

It’s fair to say that there are additional hardware failure points. It’s also fair to say that there is a reduction in software/firmware failure points.

11 Likes

We’ve been using the ZebraSwitch setup as described in their paper since 2019, so 3 robots so far.

Since 2019 we’ve had 0 problems with our limelight or network connections. Nothing at home, nothing at competition.

Throughout build season I get lots of questions from within the team:

  • “why can’t we just use the second port on the radio”
  • “why do we need a switch”
  • “why do you want to spend extra money on a switch and increase wiring”
  • “why do you need the voltage step up/down converter?”
  • “why do you want me to design a 3D printed case for it, we can just velcro it?”

Long story short:
You’ll probably come up with many excuses why not to do this, and as tho it might seem like a big effort: you’ll 1000% thank 900 for their efforts into making a robust solution that JUST WORKS.

We have a spare 3D printed box with switch and voltage regulator; we’re bringing it to Houston plus our practice robot set. If we end up playing play-offs I’ll highly recommend a team that’s not using a switch to take our spares. Our alliance partner 1884 had network radio issues in the semi finals this year and I’ve heard many other teams talking about/shouting through a match: “WE LOST OUR LIMELIGHT, WE’RE GOING TO PLAY DEFENSE”

A simple and robust solution will save you lots of headache at competition. It’s just one thing less to worry about on your robot!


If you want to adopt our design here are our CAD files; 4481 ZebraSwitch CAD - Google Drive

EDIT:

We run our radio, limelight, usb webcam through the limelight and for the practice field we connect our ethernet cable on the switch as well. It’s mounted on the side of our shooter tower for easy access and makes it easy to plug-in your ethernet for easy maintenance/testing.

You can see it down below our radio on the side of the robot.

22 Likes

This case is the best thing ever. Absolutely love it.

4 Likes

I’ve heard many other teams talking about/shouting through a match: “WE LOST OUR LIMELIGHT, WE’RE GOING TO PLAY DEFENSE”

You can literally hear our driver say this in our mic’d up match.

8 Likes

We just labeled ours.

28 Likes

Another critical reason to use a switch: when you tether your Limelight actually WORKS and you can troubleshoot it. No switch = no vision when tethered. That issue chewed a chuck off of us in 2020… after that I said: there shall be a switch. All the time.

8 Likes

The alternate solution here is using the wpilib functionality which can bridge the ethernet-over-USB functionality of the RIO to its actual ethernet port.

However, you’ll still be exposed to the issue Marshall mentioned above.

So your punchline remains true: Just use a switch. All the cool kids are doing it.

7 Likes