Bizarre cRIO Server Errors (FDIO)

Short version: We just got these messages in our cRIO NetConsole after our robot mysteriously stopped accepting inputs. This is a recurring problem.

write error: : read error: : S_errno_EPIPE
S_errno_ETIMEDOUT
[NT] IOException message: Could not write all bytes to fd stream
[NT] IOException message: Error on FDIO read
[NT] 0x28e8d18 entered connection state: SERVER_ERROR
[NT] 0x28e8d18 entered connection state: SERVER_ERROR
[NT] Close: 0x28e8d18

Does anyone know why this is or what could cause it?


Long version:

So, at our week 0 event yesterday our robot started bricking in the middle of the field. Still had communications and code, but it just stopped accepting inputs and stopped driving. The only clue we had to why this was happening was this message, every 100ms:

A timeout has been exceeded: RobotDrive... Output not updated often enough. ...in Check() in C:/WindRiver/workspace/WPILib/MotorSafetyHelper.cpp at line 117

We were able to reproduce the error today, and after waiting for a long time (about 8 minutes) we got this mysterious error output and everything started working again.

write error: : read error: : S_errno_EPIPE
S_errno_ETIMEDOUT
[NNTT]]  IIOOEExxcceeppttiioonn  mmeessssaaggee::  CEorurlodr  noont  FwDrIiOt er eaaldl
 [bNyTt]e s0 xt2o0 ef8dd 1s8t reenatme
r[eNdT ]c o0nxn2e8cet8ido1n8  setnatteer:e dS EcRoVnEnRe_cEtRiRoOnR
state: SERVER_ERROR
[NT] Close: 0x28e8d18

This was confusing, but I eventually figured out that it was two error messages outputting at the exact same instant, with every other letter belonging to a different message. When decoded, this was the result:

write error: : read error: : S_errno_EPIPE
S_errno_ETIMEDOUT
[NT] IOException message: Could not write all bytes to fd stream
[NT] IOException message: Error on FDIO read
[NT] 0x28e8d18 entered connection state: SERVER_ERROR
[NT] 0x28e8d18 entered connection state: SERVER_ERROR
[NT] Close: 0x28e8d18

What do these errors mean? What is FDIO? I have no idea where to begin with this error.

FDIO is File Descriptor Input/Output. A fd stream is a bunch of data being written to or read from a file, or a network socket, or some other I/O channel.

Are you using camera data in your robot program? The last time I heard of something like this, several years ago, one team said that it happened whenever their camera came unplugged.

We had the same problem at our scrimmage. Sad to say, but we didn’t figure out the problem when we were there.

Try reimaging. We’ve seen some really weird errors pop up that were fixed after a reimage.

The fact that the error message came out with every other letter belonging to a different error is really weird.

Could you be running into this problem? http://www.chiefdelphi.com/forums/showthread.php?t=126102

A team alumnus of ours referred us to that same bug report yesterday. We tried reproducing the error by temporarily disconnecting the robot, but weren’t able to trigger the issue. It seems to strike at random, usually when I’m not paying attention. We’re going to try cutting down on our NetworkTables usage and see if that helps.

Also, we’ll be switching to an 8-slot soon. I’m always somewhat optimistic that the 8-slot instantly solves all these mysterious problems. :wink:

I’m pretty confident that the problem you’re having is the problem I reported in the bug report. In particular, the fact that the robot came back after 8 minutes, indicates that it was hanging waiting for the network write to timeout.

Reproducing the error is easy if you replicate the exact circumstances in which the bug occurs. Two things have to happen:

  • Your robot has to write to NetworkTables via PutNumber, etc for it to freeze. If no write attempt occurs, no freeze occurs.
  • A NetworkTables client (like SmartDashboard) must be actively receiving data from the robot, and then you must disconnect the client without sending a reset back to the robot. You cannot close the program as that will cause a reset to occur. Instead, you must disconnect the wireless/ethernet cable. Sometimes after reconnecting the cable, the robot will start working again.

I’ve found that some wireless cards can induce this behavior more easily than others. For example, our driver station laptop causes this to happen very easily, whereas I find it difficult to randomly get this to happen on my work laptop. I theorize that wireless interference can cause the connection to drop randomly, but vxWorks doesn’t always pick up on it and it hangs.

The freeze is caused by the send buffer filling on vxworks for the network connection. So, if you send data less often, you will be less likely to run into this problem – but the potential for a freeze is still there.

Thanks for the tips. Oddly, I was never able to reproduce the problem when I wanted to. I tried disconnecting the ethernet cable from our driver station for varying amounts of time but it seemed to have no effect. Then it would just be sitting there, perfectly tethered in, no one touching the robot or pc, and it would freeze.

I could try to make sure that a PutNumber is constantly trying to write, but I think our robot already does that - we have sensors actively writing to the dashboard whenever teleop is running.

The last few days we have been seeing problems again, seeing the RobotDrive timeouts. They typically (and somewhat consistently) begin about a minute into the robot being enabled in teleop (regardless of whether auto ran), and we are required to reboot the cRio because it is completely unresponsive. The SmartDashboard doesn’t get any updates from switch states and the robot does not respond to any inputs. NetConsole shows a new RobotDrive safety timeout (the same as the one posted in the first post) roughly every 80ms. This makes sense for an output safety timeout, and suggests that some threads are still in operation while some other ones, particularly the ones that update outputs and the SmartDashboard, are hanging. We have not been seeing any more of the FDIO errors, only RobotDrive timeouts.

We have had other things to test and can still do so by rebooting the cRio, but we haven’t a great idea where to start debugging this problem. We are in the process of rewriting our most basic functionality in other frameworks (both LabVIEW and a C++ IterativeRobot Project) to get away from the Command Based model. It seems likely that our set of commands (all of which seem pretty standard) might be finding some edge case that causes the hang. Does anyone have any more insight into this?

If it is network table writes as has been suggested, why do these timeout? Do we really need to be guaranteeing packet delivery on our SmartDashboard updates? It seems something like UDP would be fine in this case, and would avoid timeout issues.

So I actually read all of that other thread now… it sounds like we are probably seeing the same bug with Network Tables. It looks like the bug is being worked through and a patch is in the works. In the meantime, I think we will also try establishing a separate thread dedicated to the SmartDashboard updates, so that the output threads can still keep up.

There is a complete working patch that should solve the errors you’re seeing. It’s attached to the bug report. If you’re having this much trouble with it, I’d recommend either (a) dropping NetworkTables completely and roll your own UDP solution or (b) compile/patch your own WPILib.

If you want, I can compile the WPILib binary tomorrow morning and send it your way.

I’m fine installing the patch as it is now, but I’m not sure what the process for that is. Would I have to download the source and compile it myself for that to work? If so, I would appreciate if you compiled the binary for us…

Steps:

I’d also recommend changing the constant at RobotBase.cpp line 22 to indicate that you’ve changed the code. It currently reads “C++ 2014 Update 0”. You can view this string in your driver station diagnostics, so if you change it then you can know you’re running your version of WPILib.

I’d give better instructions, but I don’t have my Windows computer home with me.

Note that the full patch is attached to the issue, and the patch in the gist link has been deleted since it’s incomplete.

Sorry for the delay, today was busier than expected. I’ve posted a compiled binary to http://firstforge.wpi.edu/sf/go/artf1719 . Unfortunately, I haven’t been able to verify it on a cRio as I don’t have access to one at the moment. However, the original WPILib binary is 13.0 MB, and this one is 13.1 MB, so I expect that it should work.

Let me know if this helps your issue!

We applied this patch, and rebuilt and redeployed our code. It appeared to improve the problem at first in that it gave the same FDIO errors as we saw previously, about as frequently as we had been getting the timeout errors, and we did not appear to be hanging. However (and we don’t really understand what happened here), after a very short time, it started timing out immediately when enabled, giving the same error at line 117 of MotorSafetyHelper.cpp. One interesting thing to note is that when disabled all these errors stopped, so it seems to still be making it through the control loop in IterativeRobot.cpp (StartCompetition()). We have most of these error messages logged, so as soon as someone has access to those text files we will post those.

We have ported all of our code to Java and LabView, and neither of these implementations seem to show the same problem, though we hope to run the bot into the ground a little more over the next few days to tease any more errors out of it. Some C++ code that doesn’t use Commands or the SmartDashboard/NetworkTables is in the works so testing that may also give some insight. We will probably stick with the Java implementation for the rest of the season.

Try this one https://www.dropbox.com/s/dbpv32qii7i7jet/JamesDustinFix.zip
I’ve spent quite a bit of time testing this, and it’s an interesting twist of fate that both Dustin and My patch compliment each other for best results during booting of cRIO. For one client stress it works perfectly… for a 3 client stress it did show one error but recovers just fine. This patch divides a new critical section to restrict the main robot’s get/put calls to a localized scope that was easier to manage… basically I wanted to minimize the work to just put the variables and get out without interference with I/O critical section operations. There is also a sleepless thread fix in there as well. This is still unofficial, but I think it is good enough to at least get some testing.