Serious bug identified in SmartDashboard/NetworkTables -- robot hangs

A bug report has been filed on the WPILib tracker here: http://firstforge.wpi.edu/sf/go/artf1719

The TL;DR is basically this: if you are connected via SmartDashboard, and you disconnect the connected computer’s wireless while the robot is writing a value to the SmartDashboard, the robot may hang until the write times out… which can be a few minutes.

I’m working on identifying a good fix, but I fear the best way to fix it is to use non-blocking I/O… which would be a rather large rewrite.

I think we’ve probably seen this. In the pits, we often have the driver station connected to the robot, with a LabVIEW dashboard using SmartDashboard/NetworkTables. When there is a programming change, the programmers plug in a second computer and run SmartDashboard. As we heavily use the preferences class and other uploaded files, we aren’t often rebooting the robot. Sometimes when we disconnect the programming computer ethernet cable, we see a hang in NetworkTables on the driver station.

We use Java on the robot, but I assume the implementation is similar to C++.

Good to know. Does your robot stop responding also, or is it just SmartDashboard that stops?

In the last few years, we’ve definitely had a robot exhibit odd behaviors where it isn’t responding to controls, but we’ve never directly been able to associate it with NetworkTables until now. We’ve always heavily used NetworkTables, particularly last year.

It was just SmartDashboard that stopped, except that we needed data from SmartDashboard to function properly. We’ve been trying to remove those constraints from our code this year, wherever possible.

Interesting. I’ve been moving more towards putting robot control in the SmartDashboard (well, a custom UI using NetworkTables), because there’s a lot you can do with a touchscreen – in particular, implementing toggle buttons using a UI is much easier than wiring up toggle buttons to attach to the DS I/O. Once the bugs in NetworkTables get ironed out, it should be a pretty good solution, and worked pretty well for us last year despite the bugs.

I’ve had something similar happen at competition before. Very often, our team does a test of all the robot functions while it’s on the cart before we put it on the field. One time, I disconnected the robot as it was sending our shooter rpm to the smartdashboard, and didn’t restart the robot before the match. The robot connected, but no smartdashboard happened, and somehow, all the indicators and the auto mode chooser disappeared, which was a problem for choosing our auto mode.

We punted the network tables stuff early in the season and went with straight up UDP sockets. There are a lot of misconceptions in the WPILib code regarding network communications. E.g., if you have a source of UDP packets talking to the cRio, you need to have a consumer running in your teleop disabled code to toss the UDP data away or the bot will hang.

This is because the implementation on WPILib tries to buffer all network traffic and deliver it regardless of whether it should or not. UDP traffic without a listener should just be tossed on the floor according to the specification. But, that’s not what WPILib does. In fact, WPILib apparently keeps allocating RAM for the network comms until the bot runs out of memory. Thanks goodness this isn’t a safety critical application.

So, if it’s possible for you, drop back to good old UDP sockets (not TCP as they require a connection be maintained). Just remember to create a thread on the cRio to run and read/throw away the packets if you’re not in an operational mode).

HTH,

Mike

I’m fairly certain that is an issue with the VxWorks socket implementation and not caused by WPILib. WPILib does not contain a UDP socket class.

Hmm… Perhaps, but I’ve been using VxWorks for 25 years and never had a problem with UDP traffic before. Of course, I wouldn’t rule out that something that NI added has changed the network implementation. It’s a moot point at this juncture as next year’s control system is embedded Linux with the PREEMPT_RT patch in place. It will be a completely different beast.

Mike

Well, the first problem is that NetworkTables uses TCP communications, not UDP communications. The second problem is regardless of whether its model of how network communications should work is good or not, the C++ implementation of NetworkTables is a truly horrible and complex piece of code – it’s clear that it was the author’s first experience with writing C++ code (and he admitted this himself when I ran into him at WPI Battle Cry last year). The library needs a rewrite.

However, despite all those problems, I do really like the idea of being able to use SmartDashboard, and I really like the simple API that is exposed on the robot. I’m loathe to reimplement SmartDashboard and NetworkTables itself, and my hope is that they’ll fix up the implementations for next year – so until then, I’ll keep patching it for the python interpreter :slight_smile:

PS: In case you’re interested, I found another obscure bug in NetworkTables tonight, that causes buffer overflows on my linux box. If you’ve ever wondered why you see gibberish in Netconsole when a NetworkTables client disconnects, I found out why.

For those interested, I’ve posted an updated patch to the bug report. Without completely rewriting NetworkTables, I think the best solution (in addition to the previous fix) is to make the sockets non-blocking, and use select with a 1-second timeout on writes.

My thought is that anything that blocks a write for more than a second is going to be useless anyways, and NetworkTables has provisions for reconnecting when the connection dies. Better than hanging permanently.

If anyone has feedback on the patch, I’d welcome it.

Our team successfully used the first part of the patch without issues in a week zero event, but we don’t have competition until mid-March, so I won’t have any hard testing of the patch until then. However, I’ve tested it extensively on Linux/Windows, and on a cRio-II that was disconnected from actual robot hardware.

No way it is a problem with the VxWorks network stack - the stack used by Wind River is a derivative of BSD 4.4, as was Linux. So VxWorks is using some of the most heavily exercised/hardened code on the planet. I’ve been using it for 20+ years without any issues.

See http://digital.ni.com/public.nsf/allkb/45FEF8E4D56D39AF862579FE0053DB93 for an explanation of why it is a VxWorks network issue.

Hmm… OK, I can understand this in light of the way cluster_buf management is handled in VxWorks. The MUX layer has to allocate space for inbound packets if the port is open. It assumes that sooner or later you’re going to have to read the data. I’m surprised that it affects UDP traffic through. Nonetheless, the behavior should be to drop the packets rather than infinitely queuing them. Maybe this behavior changed when WRS switched from the BSD stack to the Interpeak stack so they never when back to fix it in the release that we use for FIRST. I’ve been pretty disappointed in WRS’s involvment in FIRST over the past couple of years. It’s too bad they’re not really supporting the community any longer.

In any case, our approach of putting UDP read code in the teleop_disabled routine making this problem go away then makes sense. I would have expected the SO_RCVBUF buffers to fill up and then start dropping data rather than causing the stack to hang though. That’s what happens in the Linux case. Since the layer 3 (IP) is considered unreiable, Linux has no problems dropping the packets when the stack gets sufficient backpressure. Again, this shouldn’t be a problem with next year’s control system thankfully.

The subject of this thread, NetworkTables, does not use UDP. It uses TCP. :slight_smile:

Not true. This particular NetworkTables bug can be duplicated on Linux also – though it is a lot harder, because the default send buffers are a lot bigger on a PC.

The real ultimate solution is rewriting NetworkTables for C++.

Why would a port open for unicast UDP accumulate multicast packets? (the most likely source of the unknown inbound data described by NI). Or are the unexpected packets coming from SD? I wish I had time to look into this - the NI explanation seems shallow. Why is the VxWorks community at-large not reporting this anomaly?

And if this is such a huge problem, why does the SD code not implement one of the workarounds? The NI bug report and workaround recommendation is nearly 2 years old.

Hey Mike, what year did Wind River switch stacks?

Interesting - this makes the NI explanation less likely since the bug report says UDP multicast traffic is a likely source for the unexpected data. But it runs counter to Mike’s (one of the most knowledgeable VxWorks experts on the planet) successful use of a UDP-focused patch to fix the problem.

Agree with the need for a rewrite - that is some nasty looking spaghetti code!

Realize that there are two separate issues being interwoven in this thread. Mike Anderson said they are not using NT and are instead using UDP, which RufflesRidge responded to and warned about the vxworks issue. No one implied that the vxworks issue is related to the NT issue that Dustin is reporting.

Indeed, I heartily agree. (I know C++)

[ancedote]
Before last season I was helping the author clean it up and finish it so it could ship for the 2013 season. twards the end of november I looked at the code and send him a laundry list of suggestions, and among the responses were “Yes I created this in java. I did it in c++ to mirror the java api” and “Yah I know as I said I wasn’t the right person to do this”. As it was already the end of november, re-writing it was out of the question so I attempted to clean it up a bit. I managed to clean a few things up, like removing the custom UTF16 string class among other things. Then I moved over to making SFX, so never got a change to clean it up. I was hoping to with the C++11 project, but SFX took over my time, and there are not many good C++ devs. Sigh…
[/ancedote]

Anyway, I will attempt to move these patches along, though no guarentees.

I believe Mike was discussing an internal solution that they created/used to communicate instead of using NeworkTables. Their solution apparently used UDP.