PSA - Windows 10, Windows Defender Updates potentially causing disconnects or disabling robots

At our week 3 event (FNC - UNC Asheville District Event), we suffered disconnects in both of our Semi-final matches.

We conducted extensive root cause investigations over the past few days and I wanted to share our current most likely root cause so that other teams may benefit from this finding.

As part of troubleshooting, we checked the Driver Station Log File and the Event Viewer. The connection for the first semi-final match was lost at 3:07:12.183 pm very shortly after transitioning from auton to teleop. In the Event Viewer, there was a Security-SPP event for Offline downlevel migration succeeded at 3:05:05pm, another Security-SPP event for successfully scheduled Software Protection service for re-start at 3:05:36pm, and another Security-SPP event for Offline downlevel migration succeeded at 3:07:20pm. It has been reported that the Security-SPP event for Offline downlevel migration succeeded causes the windows 10 freeze up for 10-15 seconds (What can cause a Windows PC to freeze up for 10-15 seconds? - Ars Technica OpenForum ) and Windows Defender is the culprit. I think that the ‘successful’ message log entry occurs at the end of the 10-15 second freeze window, so Windows froze up about 10 to 15 seconds before that event in the log. This coincides with the loss of connection for the first SF match.

The loss of connection for the second semi-final match happened at 3:26:13.140 pm immediately after auton ended. In the Event Viewer, there was a SecurityCenter event for updated Windows Defender Status successfully to SECURITY_PRODUCT_STATE_ON at 3;24:55pm. Again, this event has also shown to cause Windows 10 freeze (Windows 10 taskbar sometimes freezes for a few seconds - Windows 10 Support). The timing on this one is a little less obvious as it occurred over 1 minute before our disconnect. But it would have occurred while the driver station was connected to the FMS prior to the start of the match.

Here are a couple of resources with some insight into FMS/networking/etc.:

A key excerpt:

The FMS server does not communicate directly with the robots. All communication is between the FMS server and the Driver Station, including states (Enabled, Disabled, Teleop, Auto, etc), Emergency Stop status, Match Time, and more. If the Driver Station loses connection with the FMS software, the watchdog function onboard the robot halts control and disables the robot.

This last sentence highlights that a laptop windows freeze for 10 or 15 seconds could be highly disruptive to the communication and could cause the robot to disable itself. This may explain why our robot DC’d over a minute after one of these windows

Since these Windows Defender updates were occurring throughout the day, it could have affected driver stations throughout the day. The curiosity thing is trigger for the update. Did it come as a push or trigger while connected to the FMS? The FMS is connected live to FIRST HQ so that they can receive data and diagnose system problems. So, perhaps when we connected to FMS prior to the match, windows started making the updates. OR, it may have been just a scheduled day/time as a result of an update earlier that day or weekend or week? Sunday afternoon (precisely Sunday noon PST - Microsoft Time) would seem to be a logical non-productive time to schedule the launch of an update if you are focused on minimizing impacts in an office-type environment.

We happened to notice that at least one other event last weekend also had robots disconnecting the same day. It might be that they were caused by this same issue.

We believe that this is a likely root cause for at least one if not both of our DC events.

Before our event this weekend we are taking the following actions for Windows firewall – whitelist the driver station, dashboard, and anything else (Windows Firewall Configuration — FIRST Robotics Competition documentation) AND disable firewall and updates.

We wanted to share this info with the CD community. If we are correct that this was the root cause of our DCs, then hopefully this information can benefit other teams by helping to avoid this same problem.

15 Likes

A few questions:

Have you run the Windows 10 decrapifier?
Did you check for updates immediately prior to the event?
Did you disable/postpone updates during the event?

The VLANs used for drive stations do not have external internet access, so I would be surprised if this had anything to do with being connected to FMS.

Maybe this trigger happens anytime network state changes on the laptop (regardless of whether the network has an external internet connection or not), but that’s as much as I can think of

4 Likes

Thank you for this thread.

This could be another reason to purchase work-level laptops for a driver station. I think Windows 10 Pro allows you to disable auto updates.

This issue can happen offline. Windows 10 caches the update while you are connected to the Internet and then attempts to install it. Sometimes it warns you first, but sometimes it just does it. This could be when you are in airplane mode. When that happens, it can lock up your computer and run through multiple reboot/install cycles.

Also of note. Windows 10 updates periodically break OEM hardware drivers. We have most recently had it happen to HDMI ports, but my personal laptop has lost USB before too.

One last annoyance, most updates seem to re-enable Windows firewall too. So that is another thing we have dealt with.

Our driver station is running Windows Home, but we force updates before an event and then block them until after.

2 Likes

The updates also reenable real time scanning, and windows defender ignores exclusion rules for directories.

I have many choice words for Microsoft, and their mentality they they own (and control) your operating system even after you purchase it.

We’ve also been bitten by the update bug. That’s one of the reasons why we have two drivers stations on hand.

2 Likes

Glad you figured this out. Thanks for making this PSA, we’ll definitely double check our Windows Defender settings and ask alliance partners to do the same.

1 Like

At the PCH District Columbus Event, our robot was dead at the start of tele-op 4/5 matches we played on the first day (Auto ran smoothly no issues or DCs). At the end of the first day/beginning of the second day we spoke to all of the RIs, FTAs, FCAs, etc… about what could be causing this. We share the same findings as wgorgen (OP), 1. the firewall was on, 2. the Wi-Fi was left on the drive station computer, 3. the radio was using a CAT-6 cable (idk the significance of this but apparently we needed CAT-5e), we figured it out and ran smoothly for the remainder of the matches we ran over the weekend. We are all new to competition this year since the pandemic halted competitions for two years so this could have also had an impact on our knowledge of good drive-station setup and procedures.

Wat? Was this a CSA or FTA that told you this?

5 Likes

You know – too many CATs. :joy_cat:

2 Likes

I thought Cat6 could handle all Cat5 traffic and then some. I ran Cat6 throughout my house when I rewired because it was the same price and I never wanted anyone to have to do that run again. I do not think there has been any problem with the cable (dropouts have happened, but we are pretty certain they are dog induced (she likes to knock the router over) not eh cat, I mean wiring. Sorry I could not resist).

1 Like

Both, plus the LRI as well, I didn’t understand it either. It did fix the problem though so I’m not complaining…

1 Like

I’m certainly not a certified expert in networking or cable tech, but from what I do know, I don’t see any reasonable justification for believing CAT6 vs CAT5e would make a difference in any way. Heck, CAT5 (no gigabit support) is likely fine given the minimal traffic going over most robot networks.

I suspect it was a bad cable to begin with, and switching to the cat5 cable fixed the issue because it’s a new cable. Bad cables are everywhere, especially when used in a rough environment. I’ve been tempted to buy a high quality Ethernet tester and leave it in my bag for when I CSA. It’d rule out bad cables real quick.

5 Likes

CAT-6 Cable is a bit different than CAT-5 in terms of the inner cable (it has an extra piece of plastic to heep the twisted pairs more separated) and in my experience it’s harder to crimp into “normal” RJ-45 connectors. (I suspect the jacket or the conductor itself is slightly thicker. If you made your own cable, this may be the issue. If it’s a COTS pre-made cable… nevermind. :slight_smile:

1 Like

“Disable” is used poorly there given it has other meanings in FRC.

It just means the motor controllers don’t get values. That way, you’re not having a PC freeze while you’re going full speed ahead and you can’t stop going full speed ahead.

Yes, a freeze for 10 to 15 seconds is going to be highly disruptive. It wouldn’t explain over a minute. A delay of around a minute occurs when your radio loses power. It wouldn’t take that long for the driver station to connect if the radio didn’t have to go through it’s entire boot.

I agree that the second event seems to not fit very well with the root cause.

The security event occurred about 45 seconds prior to the start of the match (while the driver station was hooked up to the FMS) during the pre-match team introductions.

The robot operated normally during autonomous period and performed the routine that we had selected. However, as soon as the op-mode switched to teleoperated, the robot was unresponsive to driver commands.

So, the question is whether a 10-15 second freeze in windows starting about 45 seconds before the start of the match could have caused the non-responsiveness at the start of teleop?

I do not understand the communication between the driver station and the FMS during that time period. I do not know what FMS does if the driverstation is unresponsive for 10 or 15 seconds during that time period. I do not know if the FMS will just shut down that connection with the driverstation, if it will send a command to the robot to disable once it completes the tasks that have already been queued, etc.

It’s possible that the driverstation completely recovered and was communicating normally again with FMS, but that the freeze actually just disconnected the USB controllers. This is another theory (or wrinkle to this theory) that has been suggested as a possible outcome of these updates. We were receiving the video feed from the limelight during the entire match (auton and teleop). We could see things moving in the background, so we know the feed was live. That means that the signal from the radio was getting to the FMS and then getting to our driverstation.

We may never get a complete understanding of what caused the DCs in those two matches. Everything worked flawlessly this past weekend after we implemented these fixes. It might just be a coincidence that everything seems to be fixed. We’re hoping that we have licked this problem. We hope this PSA has helped some other teams to avoid a similar fate.

I want to point out the Driver Station Best Practices document that’s now in the WPILib docs: Driver Station Best Practices — FIRST Robotics Competition documentation (wpilib.org)

Nearly every single driver station I’ve looked at since the start of the competition season (maybe 70 at this point) has Windows Defender installed on it, so I’m not sure that you can point at that by itself as your root cause. The issues I’ve seen with freezes have been related to non-essential software installed on the DS laptop.

I would point to these two things to say there’s a contradiction. If you’re getting limelight feeds, you’re not disconnected and trying to troubleshoot this as a DC is going to lead you down unproductive paths.

This is the most interesting part to me. If I read ALL of what you’ve said, here’s what I’m looking at and why:

You weren’t disconnected. Your code was running fine. You were sending commands from driver station to the robot. But, the robot wasn’t reading those commands because the joysticks got into a bad state in the code.

I saw this with a couple teams this past weekend. There were two fixes they looked at. One was a brute force “Reboot Robot Code” during the match. This re-started their code and got their joysticks running correctly. Another team added a periodic check for joystick values in their code. If the value was NULL, they ran the initialization code. They never saw an unresponsive robot again after that.

I’m less convinced this is the answer if you regained control of the robot during the match at some point. I’m also helping troubleshoot a bit blind given the nature of this thread.

Generally speaking, the FMS is funneling communication from the Driver Station to the robot and back. You should have the safety enabled on your motors in your code so it’s a product of your code shutting down the robot rather than the FMS doing it (and I wouldn’t expect the FMS to do it).

1 Like

As the drive coach for FRC 5190, the OP’s alliance partner during the event, I can provide some more insight.

The Driver Station’s robot communication and robot code lights were red, meaning it’s not just joysticks getting into a bad state. Furthermore, without communication to the robot, you can’t reset code either.

The FMS ping showed good communication to the robot radio and driver station, but the link between the radio and roboRIO was bad. The most plausible explanation to me is that the cable between the robot radio and roboRIO was damaged in some way.

1533 had their Limelight connected via the second radio port, probably with a good cable, which explains why they were able to see their Limelight feed in SmartDashboard. All other data in SmartDashboard (e.g. swerve module positions, etc.) appeared frozen which aligns with the theory that the roboRIO was disconnected from the radio.

The only suspicious thing to me was that the disconnects happened shortly after teleop began in both SF matches; however, there is nothing conclusive that would suggest that this wasn’t a coincidence.

Looking back, I am also not sure that Windows Defender was the cause of the issues, but regardless it is probably a good idea to disable it. 1533 competed in another event this weekend (after replacing the cable between the RIO and radio) and did not have any issues and won the event (congratulations!).

2 Likes

We assumed that it was the cable between the RIO and the Radio as well. However, we tested that cable using a professional tester and the cable was good. It also seems odd that a bad cable would not manifest itself during the preceding 1.5 days of competition matches, practice driving, etc. but would manifest itself during the time period where these Windows updates were occurring. I’m not saying that it could not have been a bad cable. But given our root cause investigation data, we were not able to prove that it was the cable so we kept looking for other issues.

We were also told that it could have been the SD card in the RIO that “wiggled” and caused the RIO to fault. We were not able to reproduce that issue. And again, it seemed odd that this would suddenly manifest itself during 2 time periods where the robot was not getting jostled around, but did not manifest during the many hard hits the robot experienced prior to that. Again, this does not completely rule out the SD card theory, but we could not definitively prove it, so we kept looking.

We also were concerned that there may have been some metal chips in either the radio ethernet receptacle or the RIO ethernet receptacle. Both are mounted on the robot with those receptacles facing downward, so it is difficult to imagine how a metal shaving could have gotten in there. We did do some work on the robot at various times throughout the event but care was taken to protect the electronics during that work. It’s possible that there was a metal shaving in there and that it subsequently fell out in the trailer on the trip back home from the event. We will never be able to prove one way or another whether there was a metal shaving in there.

I personally was concerned going in the Week 4 event that we had not found the cause of the issue and that it would re-appear at the worst possible time. Thankfully it did not.

1 Like