It seems the data packet dropping problem has made some victims yet again in Houston…
Am I right thinking that? In Auto Runs, a number of robots went into strange behaviors like dropping cubes in the opposite alliance’s switch or scale. The problem was well described in earlier threads on the CD forum.
For the knowledge I have of radio communications and Wi-Fi, it appears to me the frequency band the Open-Mesh radio operates in is getting saturated, probably from the lot of smartphones and tablets people forget to turn off and that keep yapping in the 2.4 GHz band.
Also, there is a fundamental flaw in Ethernet comms in that both UDP and TCP are “open-ended” communication protocols, i.e. they do not provide feedback to say whether a packet has been received or not.
If this is the problem, it must be corrected PDQ for next year (I consider this year a write-off since it is unlikely a change will be pushed out in time to do any good).
Next year, if the problem is not addressed, teams may have to add a network packet traffic logger on their robot, with time tags, to show what packets were received from the DS. Although it is a harsh reality of frequency band usage in the 21st Century. It is not fair that teams be exposed to a problem they have little control over, IMHO.
Assuming this is the same issue that’s come up numerous times this year (if it isn’t, please let me know), then this is a good read:
I’m glad 254 got a replay out of it, but I’m less glad that it continues to happen, and that we were very adamantly told it was our fault (looks like it’s not).
We also had issues with FMS connection on Carver. We’ve played 50+ matches in NC with no comms issues whatsoever, redundant barrel jack and POE with VRM wires strain relieved.
We had a match where we were disabled for ~40 seconds, with no radio reboot (we think FMS just dropped us). We had GoPro footage of the whole match and the radio lights are on in every single frame.
Our driver also says the robot felt laggy for a few seconds after reconnection.
Team 3512 has experienced packet dropping where our robot stops moving for 2-3 seconds in 3 out 4 matches on Turing so far, and it happens multiple times a match. We’ve taken steps to reduce the bandwidth we are using in efforts to solve this issue, however nothing we’ve tried has had an effect. After talking to other teams in Turing, this seems to be a widespread issue, at least on our field.
It’s really frustrating to be plagued by issues we can’t control
Post your driver station logs. Maybe we can find a smoking gun.
I can sympathize though. We’ve had packet loss problems all year. We’ve followed every best-practice, and we’ve swapped Ethernet cables, radio and roborio. Several FTAs have looked at the logs, but because it hadn’t been alliance wide it’s been deemed a problem on our end. I’ll add that some of these FTAs are former Pi - so if they saw something they would tell us.
Note that in the second file (CPU related) we saw the clear correlation between high roborio cpu usage and packet loss. That is when we are running the LabVIEW code from our laptop. Normal CPU usage with deployed code rarely strays north of 70%. (50% now that we’ve removed a bunch of unneeded stuff).
The packet loss at states was bad enough that our elevator has dropped onto the scale because of a loss of comms.
We’ve also taken to writing to the dashboard the field configuration we receive from the field just in case something like this happens, so that we can point to the value and show them exactly what we received and whether that was inline with what the actual field was. Hopefully this’ll be a bit of evidence to our side if we ever have to fight for a replay.
I don’t see any obvious “smoking gun” but it does seem like packet loss has a few spikes, as well as latency. Again, our radio never lost power during the match…
Yeah - me either. The only thing that might have a slight correlation is the high jump in cpu to 85% or so during auto. It conforms to a longer trip time and maybe some packet loss. I haven’t quantified where the cpu % crossover point starts effecting your communications though. That’s hardly a valid single data point though - I see longer trip times in other locations with no cpu rise. I’m going to go watch the video of your match. I have a hunch.
Ok - I didn’t see what I expected, but I think I’ve got a smoking gun. Check your event log to see if anything jumps out at you. The RSL (signal light) on your robot gets power from the roborio. If you watch the video when you died, your RSL dies for 2-3 seconds. That isn’t indicative of a loss of communication from the robot. That’s a loss of power to your Rio and a subsequent reboot. Check that battery’s leads, check all your battery to pdp, check the power plug of the rio by pulling on the wires, and check that your main breaker isn’t “touchy” - lightly tap the red button on it and see if you lose power. Something killed your rio in your powerpath. Especially if your gopro shows your radio never losing comms. Your event log should show a record if you lost rio, radio, or both.
In fact, I’d suspect the power between the PDP and the rio. If it were in the PDP powerpath your radio probably would have died too, although I don’t know if the VRM would keep it going from a very short loss of power.
In the first log you posted, there seems to be a correlation between bursts of what look like print events along with each burst of packet loss (the green circles near the top) what do those correlate to in your source code?
What version of the roborio firmware/driver station are you using? There was an issue in the v16 roborio firmware that reportedly caused the print functionality to be much more disruptive than it normally should be (which is still decently high impact)
In the second log you can see pretty massive trip time consistently (green line on chart) this is probably because both your rio and laptop are running at high cpu utilization to support labview debugging (as you mentioned) and it likely correlated to the packet loss you see there as well.
Between the timing and the can bus low you can see when the connection comes back this looks like your roborio lost power or shorted and restarted. I’d double check your rio connections and maybe use a ferrule crimper to put some ferrule connectors on your rio power connections if you just have twisted wires right now.
What does your event list look like around the disconnect event?
LabVIEW doesn’t utilize prints - it utilizes front panel items that don’t get used when you actually deploy code permanently. Everything is updated to the newest version as well. It’s a mystery we have yet to solve. The only thing we haven’t done that I’m going to try Saturday is to deploy default code and see if it’s still there just to rule that out. I haven’t tried disconnecting our camera, but it’s an Axis IP camera running at around 3 megabits according to the dashboard, so I don’t expect that is hurting anything.
You can still print to console from labview, but regardless, if you hover over the green circles near the top of the packet loss events you should see what the console event was in the box in the bottom left of the log file viewer.
Yeah, that line at 4:42:25.383 is a dead giveaway - the driver station and field both stayed connected to your robot radio throughout the disconnect event, but you see that it goes to crio - bad momentarily, that’s your roborio falling off of the network while its power goes down.
Just to note, even the tiniest of interruptions in power can cause the rio to reboot immediately, where the VRM does tend to keep the radio alive.
We had an issue that ended up being a slightly loose battery-breaker connection, and the slight loss of metal-metal contact caused a rio dropout, but our bridge stayed online the whole time. It was very confusing to troubleshoot partly for that reason, but a loose contact power loss may not cause anything on your VRM to drop.