Losing Comms Often

We are experiencing an odd problem, that is intermittent. We will be able to drive our robot for quite a long time without problem, then out of nowhere, we’ll start experiencing massive problems with the comms. We look through the log file and see the following:
Voltage never going below 10 volts
CAN bus utilization stable between 43-47%
CPU percentage never higher than 70%
We have current limiting and ramps, so never drawing more than 80 amps
Packet Trip time is somewhere between 4-7

Ergo, none of the usual suspects for causing dropped comms. The odd thing is that it’s not a full radio boot, we lose comms, and the driver station reconnects back after ~5 seconds. We are just getting spurts of massive network issues.

What we have tried:
Multiple driver stations
Multiple radios
Limelight disconnected (no camera streams at all)
Eliminating most of our network tables (we are down to about 8 booleans total).
Code reviewed to make sure no while loops, or anything else that would cause us to lose out on the control loop

I’ve gone through everything I can think of, so I’m looking for other ideas that we are missing. It doesn’t seem like it’s an electrical problem since the radio recovers so quickly, but I’m happy to hear why I’m wrong about that. Zero brown out events in the log file, so that also makes me think that we’re not having electrical issues. We are not running it with the bandwidth limitations, we are just flashed as a normal radio. Radio is connected over PoE only.

Any other troubleshooting steps to try to track it down?

Are you able to post the .dslog file from the Driver Station?

Could also be your computer utilization. I’ve had it where our team’s laptop had lots of applications open and noticed a flaky (for lack of a better word) ds connection. We close out those applications and it appears to improve it. A full laptop reboot seemed to fix it entirely.

We had a similar situation and corrected it by removing our NAVX. We tried multiple roborios and the same loss of comms was happening. Limelight is still attached.

I can get access to the logs tonight, forgot to put them on a USB stick when we were done testing yesterday. I’ll post them here when I have them.

Interesting thought on the Navx. We aren’t even really using it this year, so aside from the fact that it’s incredibly hard to remove it with our intake design, should be easy. Will try that tonight.

We weren’t rebooting every time, but given that it was multiple driver stations (at various stages of reboot/usage) makes me think it’s not there but we compete this weekend so we are at defcon 1 with regard to getting this working as this problem is keeping our drivers from being able to practice. Will try this as a differential when it starts happening again.

Make sure all firewalls are turned off and there’s no software running in the background. A firewall could reasonably see unknown packets being sent/received and think that it’s malware that needs to be blocked. Or an updater running in the background that decided it needs to check for an update right now so it holds up all network traffic for a few seconds.

We have the same issue where it’s fine most of the time, but there will be periods of tons of dropped packets and lag, then it clears itself up. Not sure what’s happening, but it’s been a problem over several robots, routers, and computers

We looked down this path as well. The weird thing is that once it starts happening, I can get it to trigger with 100% reliability within 3-5 seconds. Then we swap out the battery, and it will work for 15 minutes just fine. Swap the battery, starts breaking again immediately. It’s crazy hard to pin down, but it seems too predictable to be software updates.

Firewalls are already disabled.

I appreciate any and all ideas that we may have missed though. Stress levels are high.

Have you tried tethering, and does that have any impact?

We’ve experienced issues with WiFi spectrum congestion at our build site that can lead to similar issues. We’re building in a site with an office upstairs broadcasting a strong guest WiFi signal. In our space, we also have our own team access point broadcasting a WiFi signal for our students. Then of course every one of our robots has its own radio.

When I first set up our team access point, the office WiFi was occupying channel 6, so I set ours to channel 11, and all the team radios are on channel 1. This seemed to work fine until one day we noticed our WiFi connection speed was terrible, and students were experiencing intermittent connection errors with the robots. When I investigated, I found that the office WiFi had changed to channel 11, and I assume the access points and all our devices were engaged in a massive 2.4 GHz “shouting match” that was causing interference for everyone. Switching our team access point back to channel 6 made all the problems go away.

try removing the poe cable. Sometimes poe cables seem to have a bit of problems. Also, although you said there were no shorts, I would check the circuit breaker and see if it is loose and the 10amp fuse that give power to the rio. They seemed to be the most common place in which people have most problem with when it comes to comms.

If you can make the problem occur on command, have someone watching the roboRIO lights, particularly the COMM one. Is it going from green to red, and then red to green again ~5 seconds later? If so, that is probably a code crash happening.

Check the wiring from the roboRIO to the PDP, including the dedicated fuse in the PDP. Press it in. If your finger doesn’t hurt when doing so, you aren’t pressing hard enough. Make sure the connector in the rio is tight and that the wires into the connector are inserted fully with no whiskers possibly shorting across. Do the same on the Weidmuller connection in the PDP, making sure that enough wire was stripped there, it is fully inserted, and there are no whiskers.

As far as programming, try uploading the basic drive template or something similar that has no extra stuff and see if that stays connected.

You didn’t say what language you are using. If C++ or java, make sure you are not generating printf/cout or System.out statements for debugging. These cause the output buffer to driver station to fill up and stall comms.

You switched laptops, but one of the CAD programs is notorious for blocking comms on the driver station. Remove the CAD programs from your main driver station.

Posting a DS log would help.

Make a very basic robot program. Deploy to the robot and see if the behaviour stops. If so, make a copy of your main robot code, and edit to remove half of the functionality. If the comms problem persists, remove half more fuctionality. If the problem stops, add half the functionality back in. re-test and repeat until you narrow down the code causing the issue.

Report back what your current status is and if you fixed the problem.

Thanks for the replies. I haven’t been here on CD for a bit, and should have updated. Fortunately we fixed it, unfortunately, we didn’t do it scientifically enough to know exactly which solution fixed it.

We ended up swapping out the PoE cable, and I had them re-wire the power to the VRM as well. Interestingly enough, the power fluctuations were never enough to have the radio reboot or fully lose comms (as reported by driver station). We were just fluctuating enough to have our control packets not make it to the robot for a brief amount of time.

Thanks all for the advice.

3 Likes

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.