Chief Delphi

Chief Delphi (http://www.chiefdelphi.com/forums/index.php)
-   Programming (http://www.chiefdelphi.com/forums/forumdisplay.php?f=51)
-   -   Bizarre cRIO Server Errors (FDIO) (http://www.chiefdelphi.com/forums/showthread.php?t=126589)

bvisness 16-02-2014 17:05

Bizarre cRIO Server Errors (FDIO)
 
Short version: We just got these messages in our cRIO NetConsole after our robot mysteriously stopped accepting inputs. This is a recurring problem.

Code:

write error: : read error: : S_errno_EPIPE
S_errno_ETIMEDOUT
[NT] IOException message: Could not write all bytes to fd stream
[NT] IOException message: Error on FDIO read
[NT] 0x28e8d18 entered connection state: SERVER_ERROR
[NT] 0x28e8d18 entered connection state: SERVER_ERROR
[NT] Close: 0x28e8d18

Does anyone know why this is or what could cause it?

---------------------------------------------
Long version:

So, at our week 0 event yesterday our robot started bricking in the middle of the field. Still had communications and code, but it just stopped accepting inputs and stopped driving. The only clue we had to why this was happening was this message, every 100ms:

Code:

A timeout has been exceeded: RobotDrive... Output not updated often enough. ...in Check() in C:/WindRiver/workspace/WPILib/MotorSafetyHelper.cpp at line 117
We were able to reproduce the error today, and after waiting for a long time (about 8 minutes) we got this mysterious error output and everything started working again.

Code:

write error: : read error: : S_errno_EPIPE
S_errno_ETIMEDOUT
[[NNTT]]  IIOOEExxcceeppttiioonn  mmeessssaaggee::  CEorurlodr  noont  FwDrIiOt er eaaldl
 [[bNyTt]e s0 xt2o0 ef8dd 1s8t reenatme
r[eNdT ]c o0nxn2e8cet8ido1n8  setnatteer:e dS EcRoVnEnRe_cEtRiRoOnR
state: SERVER_ERROR
[NT] Close: 0x28e8d18

This was confusing, but I eventually figured out that it was two error messages outputting at the exact same instant, with every other letter belonging to a different message. When decoded, this was the result:

Code:

write error: : read error: : S_errno_EPIPE
S_errno_ETIMEDOUT
[NT] IOException message: Could not write all bytes to fd stream
[NT] IOException message: Error on FDIO read
[NT] 0x28e8d18 entered connection state: SERVER_ERROR
[NT] 0x28e8d18 entered connection state: SERVER_ERROR
[NT] Close: 0x28e8d18

What do these errors mean? What is FDIO? I have no idea where to begin with this error.

Alan Anderson 16-02-2014 19:57

Re: Bizarre cRIO Server Errors (FDIO)
 
FDIO is File Descriptor Input/Output. A fd stream is a bunch of data being written to or read from a file, or a network socket, or some other I/O channel.

Are you using camera data in your robot program? The last time I heard of something like this, several years ago, one team said that it happened whenever their camera came unplugged.

rwkling1 16-02-2014 20:11

Re: Bizarre cRIO Server Errors (FDIO)
 
We had the same problem at our scrimmage. Sad to say, but we didn't figure out the problem when we were there.

magnets 16-02-2014 20:18

Re: Bizarre cRIO Server Errors (FDIO)
 
Try reimaging. We've seen some really weird errors pop up that were fixed after a reimage.

The fact that the error message came out with every other letter belonging to a different error is really weird.

Joe Ross 17-02-2014 02:34

Re: Bizarre cRIO Server Errors (FDIO)
 
Could you be running into this problem? http://www.chiefdelphi.com/forums/sh...d.php?t=126102

bvisness 17-02-2014 11:23

Re: Bizarre cRIO Server Errors (FDIO)
 
Quote:

Originally Posted by Joe Ross (Post 1344330)
Could you be running into this problem? http://www.chiefdelphi.com/forums/sh...d.php?t=126102

A team alumnus of ours referred us to that same bug report yesterday. We tried reproducing the error by temporarily disconnecting the robot, but weren't able to trigger the issue. It seems to strike at random, usually when I'm not paying attention. We're going to try cutting down on our NetworkTables usage and see if that helps.

Also, we'll be switching to an 8-slot soon. I'm always somewhat optimistic that the 8-slot instantly solves all these mysterious problems. ;)

virtuald 23-02-2014 16:26

Re: Bizarre cRIO Server Errors (FDIO)
 
Quote:

Originally Posted by bvisness (Post 1344452)
A team alumnus of ours referred us to that same bug report yesterday. We tried reproducing the error by temporarily disconnecting the robot, but weren't able to trigger the issue. It seems to strike at random, usually when I'm not paying attention. We're going to try cutting down on our NetworkTables usage and see if that helps.

I'm pretty confident that the problem you're having is the problem I reported in the bug report. In particular, the fact that the robot came back after 8 minutes, indicates that it was hanging waiting for the network write to timeout.

Reproducing the error is easy if you replicate the exact circumstances in which the bug occurs. Two things have to happen:
  • Your robot has to write to NetworkTables via PutNumber, etc for it to freeze. If no write attempt occurs, no freeze occurs.
  • A NetworkTables client (like SmartDashboard) must be actively receiving data from the robot, and then you must disconnect the client *without* sending a reset back to the robot. You cannot close the program as that will cause a reset to occur. Instead, you must disconnect the wireless/ethernet cable. Sometimes after reconnecting the cable, the robot will start working again.

I've found that some wireless cards can induce this behavior more easily than others. For example, our driver station laptop causes this to happen very easily, whereas I find it difficult to randomly get this to happen on my work laptop. I theorize that wireless interference can cause the connection to drop randomly, but vxWorks doesn't always pick up on it and it hangs.

The freeze is caused by the send buffer filling on vxworks for the network connection. So, if you send data less often, you will be less likely to run into this problem -- but the potential for a freeze is still there.

bvisness 23-02-2014 17:06

Re: Bizarre cRIO Server Errors (FDIO)
 
Thanks for the tips. Oddly, I was never able to reproduce the problem when I wanted to. I tried disconnecting the ethernet cable from our driver station for varying amounts of time but it seemed to have no effect. Then it would just be sitting there, perfectly tethered in, no one touching the robot or pc, and it would freeze.

I could try to make sure that a PutNumber is constantly trying to write, but I think our robot already does that - we have sensors actively writing to the dashboard whenever teleop is running.

Aren Siekmeier 27-02-2014 23:11

Re: Bizarre cRIO Server Errors (FDIO)
 
The last few days we have been seeing problems again, seeing the RobotDrive timeouts. They typically (and somewhat consistently) begin about a minute into the robot being enabled in teleop (regardless of whether auto ran), and we are required to reboot the cRio because it is completely unresponsive. The SmartDashboard doesn't get any updates from switch states and the robot does not respond to any inputs. NetConsole shows a new RobotDrive safety timeout (the same as the one posted in the first post) roughly every 80ms. This makes sense for an output safety timeout, and suggests that some threads are still in operation while some other ones, particularly the ones that update outputs and the SmartDashboard, are hanging. We have not been seeing any more of the FDIO errors, only RobotDrive timeouts.

We have had other things to test and can still do so by rebooting the cRio, but we haven't a great idea where to start debugging this problem. We are in the process of rewriting our most basic functionality in other frameworks (both LabVIEW and a C++ IterativeRobot Project) to get away from the Command Based model. It seems likely that our set of commands (all of which seem pretty standard) might be finding some edge case that causes the hang. Does anyone have any more insight into this?

If it is network table writes as has been suggested, why do these timeout? Do we really need to be guaranteeing packet delivery on our SmartDashboard updates? It seems something like UDP would be fine in this case, and would avoid timeout issues.

Aren Siekmeier 27-02-2014 23:53

Re: Bizarre cRIO Server Errors (FDIO)
 
So I actually read all of that other thread now... it sounds like we are probably seeing the same bug with Network Tables. It looks like the bug is being worked through and a patch is in the works. In the meantime, I think we will also try establishing a separate thread dedicated to the SmartDashboard updates, so that the output threads can still keep up.

virtuald 28-02-2014 00:25

Re: Bizarre cRIO Server Errors (FDIO)
 
Quote:

Originally Posted by compwiztobe (Post 1350781)
It looks like the bug is being worked through and a patch is in the works.

There is a complete working patch that should solve the errors you're seeing. It's attached to the bug report. If you're having this much trouble with it, I'd recommend either (a) dropping NetworkTables completely and roll your own UDP solution or (b) compile/patch your own WPILib.

If you want, I can compile the WPILib binary tomorrow morning and send it your way.

bvisness 28-02-2014 00:28

Re: Bizarre cRIO Server Errors (FDIO)
 
I'm fine installing the patch as it is now, but I'm not sure what the process for that is. Would I have to download the source and compile it myself for that to work? If so, I would appreciate if you compiled the binary for us...

virtuald 28-02-2014 00:38

Re: Bizarre cRIO Server Errors (FDIO)
 
Steps:

I'd also recommend changing the constant at RobotBase.cpp line 22 to indicate that you've changed the code. It currently reads "C++ 2014 Update 0". You can view this string in your driver station diagnostics, so if you change it then you can know you're running your version of WPILib.

I'd give better instructions, but I don't have my Windows computer home with me.

virtuald 28-02-2014 00:42

Re: Bizarre cRIO Server Errors (FDIO)
 
Note that the full patch is attached to the issue, and the patch in the gist link has been deleted since it's incomplete.

virtuald 28-02-2014 23:32

Re: Bizarre cRIO Server Errors (FDIO)
 
Sorry for the delay, today was busier than expected. I've posted a compiled binary to http://firstforge.wpi.edu/sf/go/artf1719 . Unfortunately, I haven't been able to verify it on a cRio as I don't have access to one at the moment. However, the original WPILib binary is 13.0 MB, and this one is 13.1 MB, so I expect that it should work.

Let me know if this helps your issue!


All times are GMT -5. The time now is 21:00.

Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
Copyright © Chief Delphi