Lost Comms and Code Crash 291: A Full Report

Hey CD,

291 of all teams, has had more than our fair share of comms problems in the past few years. 2015 we had a bad driver station Ethernet port, and in 2016, some loose connections to the VRM and a host of other wiring problems. This year, we made sure these things never happened, and they didn’t. Yet we still died on the field during playoffs. :frowning:

This year at Pittsburgh however, it was different.
I will give a full account as best I can including all of the details, possible causes, and symptoms. Bear with me. Keep in mind, we use LabVIEW

Practice day on Thursday was uneventful (except for the usual rebuilding of major subsystems :p) with no comms loss in any of our matches. I can say, we took some very hard hits. The electrical system was as solid as a rock.

On Friday, In qualification match 31, we lost connection with about 18 seconds left in the match. As to be expected, comms did not come back before the end of the match. I talked to the FTA, and he told me that he never lost communication with our radio, the roborio went out for around a second, and our code was rebooting until the end of the match. (I never saw comms or code come back on the driver station, both stayed red). Here is a video of this happening. We had not bumped into anything for a very long time, and we thoroughly tug tested all of the wires and fuses, and checked for shorts, but nothing seemed out of place. So, we switched the Ethernet wire from the radio to the roborio, and hoped that was the problem. It seemed to fix it as we went through the rest of our qualification matches on Friday and Saturday with no problems.

I should also mention, that we switched to using the 2016 radio before any matches on Thursday because of problems with the 2017 radio with the Axis camera. Our 2016 radio has some minor damage to the plastic casing, but we used it for 5 weeks of rigorous driver practice on our practice robot, and it had no problems (that doesn’t mean it didn’t, I’m just explaining why we thought it was a good idea to use a damaged radio).

On Saturday, during our first quarterfinal match, we had no comms problems, but there was severe lag in the controls for all 6 robots, so the match was replayed.

In the replay match, around 30 seconds into teleop, we collided with 144, and around 1 second later we lost comm. (Note that it wasn’t immediate. The RSL continued to flash, and the robot was able to turn under its own power after the collision. Comms did not come back for the entire match. I talked to the FTA, and he told me he never lost connection to the radio, he lost the roborio, then it came back (I am blanking on the important detail of how long this took) the code then came back for around a half second, then went away and didn’t come back. Here is a video.

Before our next match, we switched the Ethernet cable from the radio to the roborio, I unplugged the Axis camera from the radio, and we tug tested everything. I also used a usb to Ethernet on the drivers station just in case. Our next match we did not loose comm (our alliance partner, 217, did, but that was a wiring problem). We also made it through the last quarters match with no comms problems.

Our first semis match, we lost comms after several mini-collisions around 30 seconds into teleop and did not come back. The FTA told me the same thing as the previous time. He also told me that the controls were lagging for everyone until he lost us, then they went back to normal, which is very strange. Here is a video.

4027 and 217 then made the very smart decision to call in a backup robot for us, so after that our day was over.

As soon as we got home, I tried to reproduce the problem, but could not. I did, however realize that when our code first comes on (after restarting), Robot code turns green, then flashes red for a split second, then goes green and stays green. It had been doing this all year, but I hadn’t thought much of it. I tried putting on the default LabVIEW robot code (with no can code). It came up solid green, and did not flash red at all. I copied the initialization of a talon srx (what we use for driving) over to the default code. Robot code again flashed red on startup. I tried this several times. The talon code caused the red flash. I also tried putting on example code for the talon srx’s. It also flashed red.

The next practice, by chance, we disconnected from our test board (the Ethernet came out). We plugged back in, but when comms came back, robot code flashed green once, then went red and stayed red until we rebooted!!! This is exactly what happened on the field. I again tried the default code, (no srx) and disconnected and reconnected the ethernet. The code came back. I put in srx code. The code did not come back. I put in the example srx code and did the test, code came back.

We now had it narrowed down to two triggers, unplugging the Ethernet connection to the drivers station caused it, and talon code was a correlation (at least my talon code).

We then tried connecting through the radio over Ethernet. When we would unplug the Ethernet from the radio to the rio, the robot code would not come back.

I now have the test board at home, and I cannot for the life of me reproduce this problem. I have the same laptop, rio, radios, and Ethernet cables. But robot code comes back every time even with talon code in there.
I don’t know what to do! I know I haven’t fixed the problem (whatever it was), but it has gone away and I can’t make it come back to see what it actually was. I am posting this here to see what you all have to say.

I know there are some very smart people here on CD, let’s see what we can come up with. Sorry about the book of a post, I wanted to put all relevant info into this post because I feel all of it is important and needed to come up with a solution. We will be going to the Buckeye regional next weekend, and I have to solve this before then.

Thanks,
Julia Cecchetti

Just a random guess here, but what is your roboRIO CPU% when the code is running? I had a weird problem with one team where they had a few loops running in periodic without waits. Their CPU% was constantly >95% and that caused weird issues with roboRIO startup and connecting to FMS. Adding waits in the loops lowered their CPU% dramatically and fixed them problem.

You’ll also probably get better help if you actually post either your code in a zip file or pictures of all of the VIs (zip file is easier). If we can agree it’s a user code problem, with enough people looking through your code someone should be able to spot the bug.

Here is my LabVIEW code.

Also, This thread isn’t showing up in the portal, why is that?

IIRC some are defaulted not to show up, like anything in the chit-chat section.

Some have theirs set up to show these though (done by going to user cp -> recent posts config) and choosing to add on ones that arent defaulted to show up

How can I get this thread to show up? Should I delete and repost somewhere else?

It only doesn’t show up for you, probably due to your CD settings as MikLast noted.
Go to *user cp *then *Recent Posts Config *to see if Technical -> Control System is checked.
I see it in the Portal.

Here is a screenshot of the ds log of the semifinal match

Can you upload the log file itself?

Could you zip and post the two actual DS log files (.dsevents/.dslog) associated with that match?
There is a lot more information there that would be useful to look at.

The files are in C://Users/Public/Public Documents/FRC/Log Files
Each log is made up of two files:

  • 2017_03_18 14_23_40 Sat.dsevents
  • 2017_03_18 14_23_40 Sat.dslog

Here are the Log files for quals 31, and the two playoff matches on Saturday

I’m seeing this in all three logs, and then approx. 2-3 minutes thereafter the robot looses comms:

11:32:47.923 AM
ERROR -44003 FRC: Operation failed due to a communication failure with the camera. WPI_CameraIssue HTTP Request with Authentication.vi

Any of my CSA buddies wanna help out and take a look? My Labview knowledge is somewhat lacking.

The camera message isn’t going to be related to this particular problem. It’s probably that the IP camera declared in the code isn’t connected when the code first goes to look for it.

I suspect that the CAN wiring on the robot itself is contributing to the problem-if your electronics board didn’t take some of the CAN wiring that was on the robot, or changed it when you setup for operation off the robot.
Possibly a poor connection along the CAN wiring chain that’s causing CAN packet delays or errors.
On your next practice day also look at the *Power & CAN *tab on the Driver Station and watch the CAN Metrics for more information.

One issue from the logs possibly may be the radio placement in the robot and what it is close to, e.g., clear line of sight 360 degrees around the robot (within the constraints of the robot frame, not mounted on or close to metal or noise sources like the battery/VRM.
This may also be contributed to by the conflicting bandwidth problem you said the teams were having.

I just unplugged one of the can wires, and the code still comes back. To be clear, I have an “identical” electrical board to what is on our competition robot minus the NavX: 6 talon srx’s, the same radios, and USB camera. I reproduced the problem with this exact electronics board yesterday at practice, but after I got home (around a half hour later) I couldn’t make it happen again. I have the electronics board in front of me right now, so I can try everything as you all suggest things.

So throw out ideas, experiences, or even hunches! I’m willing to try anything

A completely disconnected CAN wire shouldn’t be a problem.
I’d look for noisy or insufficiently terminated CAN wiring that still works (obviously since you can drive), but might be consistently requiring retries.

Look for things that are different from when you were at practice and got it to fail, e.g., a particular battery, the encoders providing input, the motor loads on the practice bot, that kind of thing.

I can’t look at all your code since it spreads outside your project files (see attached), but from what you did provide I can’t see anything unusual that might contribute to this problem.
It would be nice to see your CAN Teleop code.

P.S.
It may also be a problem that only manifests on a brief power dip. You can try purposely rigging up a way to interrupt the roboRIO power only momentarily and try doing that over and over again to see if the problem only occurs a small percentage of the time.
If it happened on a particular Ethernet cable jarring loose, then try it with that cable.
I’ve have roboRIOs that don’t always initialize the network adapter properly on power up and need to be power cycled to get it to work again. If you get it to happen again take a look at the two status lights on the roboRIO Ethernet connector to see that you have both a green and a yellow status light on.

3Motorvi.png


3Motorvi.png

I’d second the battery suggestion. I found out one of our batteries used in a match had a dead cell. I can imagine a situation where getting into a collision draws a large current spike causing the voltage to drop below brown-out levels on the roborio. I’m not sure exactly how much current you’d be able to draw with a dead cell and if it would be enough to bring it to blackout levels. Here is the analysis of our battery with a dead cell: http://imgur.com/1gLa0YL

Your CPU usage is on the high side, especially during Auto (it’s nearly 100%). Are you doing something particularly memory intensive or could it be a code inefficiency somewhere? (I would look at your code for myself but I forgot that my LabVIEW expired and I haven’t gotten around to updating it) I was working with a team at ISR3 whose code wouldn’t start because they had their CPU% too high. Lowering that allowed the code to start. I wonder if you aren’t having the same problem.

Something to try is to hit the Restart Robot Code button on the DS after the robot starts up and see if the code starts or not. For the team I was working with, their code wouldn’t start with the roboRIO but when they hit Restart Root Code it would start. I also wonder if when this problem occurs if you hit Restart Robot Code if it would fix it.

Another thing I noticed is that your battery voltage drops pretty low during the matches. Have you checked the tightness of all the connections between the battery and PDP? A loose connection there (if you can move it it’s loose) can cause your voltage to fall and roboRIO to reset. That wouldn’t cause your code to not restart, but if your roboRIO never restarts then your code won’t have to either. I also see from your logs that you have a 6 CIM drive and your driver drives very aggressively. Are you doing any voltage ramping to avoid brownouts and blackouts?

Mark, what makes you think that it’s a CAN issue? Their CAN bus usage is only ~30%. The only way I can think of CAN making the code not start is if in Begin the robot is looking for CAN devices that don’t exist and it’s waiting to find them. If that were true, the code would never start up, not just after a big hit. If the wire were loose and only connecting sometimes, I would think they would see driving problems or at least some lag. I’m guessing you’re thinking something else that didn’t occur to me.

The reason is 100% in auto (and before when auto is disabled), is because we were doing vision tracking. In teleop, the camera is not processing images so the CPU is lower. I will take screenshots of all of my VIs so those of you that don’t have LabVIEW can see it. The part of the code that extends outside the project is the 3rd party stuff: the talon srx code, and the NavX code. I’m not sure how to include it in the file so the project finds them.

About the 6 CIM aggressive driving… yeah that’s me :wink: . We are geared for 18.5 fps. We tried voltage ramping, but things happen so fast on this robot, that the slight lag that it caused made the robot near uncontrollable at high speed. When I accelerate, I ramp it myself by taking around a half second to push the joystick to 100%. During build season, we were having major issues with brownouts, but like you said, it was because of loose connections which we fixed. Even in an extended brownout (like 3 seconds) we never had the roborio restart. Everything is now as solid as a rock. If you look at the logs, I come close, but usually do not brown out.

About the battery. Most of our batteries are new in 2017, and when I am driving I can actually feel the difference when we use a 2016 battery because of reduced performance/brownouts. I would have known about a dead cell as soon as I touched the joysticks.

I also don’t believe our roborio ever rebooted. It came back too fast (way less than 40 seconds). And when we reproduced the problem in the shop, the only thing we did was put can talon code on the rio, and disconnect and reconnect to the radio. After around 15-20 seconds, comms and code would come back, but a split second later the code would go red. Without the can talon code, and the exact same setup, the code would always come back and stay back.

All of this makes me believe that, even if the can code isn’t the problem, it has something to do with it. So, here is my solution unless I can reproduce the problem again and fix it with can code in there:

-We ordered Sparks from REV robotics and will be replacing the 6 talon srx’s. If anyone has any experience with sparks on a drivetrain like ours, please do share.
-I am taking everything unnecessary out of the code. This includes vision tracking with the axis camera, and some feedback to the custom LabVIEW dashboard. Along with removing the talons, It will also get rid of the current feedback, and current limiting on the talons (although I could never really tell the difference between current limiting on the drivetrain and not).

Let me know whether you think these are good ideas, besides a full solution. If I can get it to happen again, I will scientifically search for the cause, and write down everything I do, and post it here for anyone with the same problem to see.

It’d be easier and quicker to switch the Talon SRX’s to PWM control and leave them in place rather than switch to a different motor controller.

Just wire one of the Talon CAN wire pairs to signal and ground of a PWM output (green=PWM ground & yellow=PWM signal), and leave the other CAN pair disconnected. The Talon auto senses which control method to use. See page 12 of the Talon SRX User’s Guide.pdf.

Wow. I did not know you could do that… In the code do I just initialize it like a normal PWM controller?

Yes, for instance in Begin the motor vi’s have Talon SRX as one of the choices-that’s for PWM only.

Talon PWM.png


Talon PWM.png