Chief Delphi

Chief Delphi (http://www.chiefdelphi.com/forums/index.php)
-   C/C++ (http://www.chiefdelphi.com/forums/forumdisplay.php?f=183)
-   -   Serious bug identified in SmartDashboard/NetworkTables -- robot hangs (http://www.chiefdelphi.com/forums/showthread.php?t=126102)

virtuald 08-02-2014 23:39

Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
A bug report has been filed on the WPILib tracker here: http://firstforge.wpi.edu/sf/go/artf1719

The TL;DR is basically this: if you are connected via SmartDashboard, and you disconnect the connected computer's wireless while the robot is writing a value to the SmartDashboard, the robot may hang until the write times out.. which can be a few minutes.

I'm working on identifying a good fix, but I fear the best way to fix it is to use non-blocking I/O... which would be a rather large rewrite.

Joe Ross 11-02-2014 13:38

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
I think we've probably seen this. In the pits, we often have the driver station connected to the robot, with a LabVIEW dashboard using SmartDashboard/NetworkTables. When there is a programming change, the programmers plug in a second computer and run SmartDashboard. As we heavily use the preferences class and other uploaded files, we aren't often rebooting the robot. Sometimes when we disconnect the programming computer ethernet cable, we see a hang in NetworkTables on the driver station.

We use Java on the robot, but I assume the implementation is similar to C++.

virtuald 11-02-2014 14:49

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Good to know. Does your robot stop responding also, or is it just SmartDashboard that stops?

In the last few years, we've definitely had a robot exhibit odd behaviors where it isn't responding to controls, but we've never directly been able to associate it with NetworkTables until now. We've always heavily used NetworkTables, particularly last year.

Joe Ross 11-02-2014 15:02

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
It was just SmartDashboard that stopped, except that we needed data from SmartDashboard to function properly. We've been trying to remove those constraints from our code this year, wherever possible.

virtuald 13-02-2014 14:48

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by Joe Ross (Post 1341291)
It was just SmartDashboard that stopped, except that we needed data from SmartDashboard to function properly. We've been trying to remove those constraints from our code this year, wherever possible.

Interesting. I've been moving more towards putting robot control in the SmartDashboard (well, a custom UI using NetworkTables), because there's a lot you can do with a touchscreen -- in particular, implementing toggle buttons using a UI is much easier than wiring up toggle buttons to attach to the DS I/O. Once the bugs in NetworkTables get ironed out, it should be a pretty good solution, and worked pretty well for us last year despite the bugs.

Jared 17-02-2014 07:20

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by Joe Ross (Post 1341264)
I think we've probably seen this. In the pits, we often have the driver station connected to the robot, with a LabVIEW dashboard using SmartDashboard/NetworkTables. When there is a programming change, the programmers plug in a second computer and run SmartDashboard. As we heavily use the preferences class and other uploaded files, we aren't often rebooting the robot. Sometimes when we disconnect the programming computer ethernet cable, we see a hang in NetworkTables on the driver station.

We use Java on the robot, but I assume the implementation is similar to C++.

I've had something similar happen at competition before. Very often, our team does a test of all the robot functions while it's on the cart before we put it on the field. One time, I disconnected the robot as it was sending our shooter rpm to the smartdashboard, and didn't restart the robot before the match. The robot connected, but no smartdashboard happened, and somehow, all the indicators and the auto mode chooser disappeared, which was a problem for choosing our auto mode.

taichichuan 22-02-2014 09:29

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
We punted the network tables stuff early in the season and went with straight up UDP sockets. There are a lot of misconceptions in the WPILib code regarding network communications. E.g., if you have a source of UDP packets talking to the cRio, you need to have a consumer running in your teleop disabled code to toss the UDP data away or the bot will hang.

This is because the implementation on WPILib tries to buffer all network traffic and deliver it regardless of whether it should or not. UDP traffic without a listener should just be tossed on the floor according to the specification. But, that's not what WPILib does. In fact, WPILib apparently keeps allocating RAM for the network comms until the bot runs out of memory. Thanks goodness this isn't a safety critical application.

So, if it's possible for you, drop back to good old UDP sockets (not TCP as they require a connection be maintained). Just remember to create a thread on the cRio to run and read/throw away the packets if you're not in an operational mode).

HTH,

Mike

RufflesRidge 22-02-2014 18:56

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by taichichuan (Post 1347935)
There are a lot of misconceptions in the WPILib code regarding network communications. E.g., if you have a source of UDP packets talking to the cRio, you need to have a consumer running in your teleop disabled code to toss the UDP data away or the bot will hang.

I'm fairly certain that is an issue with the VxWorks socket implementation and not caused by WPILib. WPILib does not contain a UDP socket class.

taichichuan 23-02-2014 00:34

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Hmm... Perhaps, but I've been using VxWorks for 25 years and never had a problem with UDP traffic before. Of course, I wouldn't rule out that something that NI added has changed the network implementation. It's a moot point at this juncture as next year's control system is embedded Linux with the PREEMPT_RT patch in place. It will be a completely different beast.

Mike

virtuald 23-02-2014 02:43

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by taichichuan (Post 1348286)
Hmm... Perhaps, but I've been using VxWorks for 25 years and never had a problem with UDP traffic before. Of course, I wouldn't rule out that something that NI added has changed the network implementation. It's a moot point at this juncture as next year's control system is embedded Linux with the PREEMPT_RT patch in place. It will be a completely different beast.

Mike

Well, the first problem is that NetworkTables uses TCP communications, not UDP communications. The second problem is regardless of whether its model of how network communications should work is good or not, the C++ implementation of NetworkTables is a *truly* horrible and complex piece of code -- it's clear that it was the author's first experience with writing C++ code (and he admitted this himself when I ran into him at WPI Battle Cry last year). The library needs a rewrite.

However, despite all those problems, I *do* really like the idea of being able to use SmartDashboard, and I really like the simple API that is exposed on the robot. I'm loathe to reimplement SmartDashboard and NetworkTables itself, and my hope is that they'll fix up the implementations for next year -- so until then, I'll keep patching it for the python interpreter :)

PS: In case you're interested, I found another obscure bug in NetworkTables tonight, that causes buffer overflows on my linux box. If you've ever wondered why you see gibberish in Netconsole when a NetworkTables client disconnects, I found out why.

virtuald 24-02-2014 00:38

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
For those interested, I've posted an updated patch to the bug report. Without completely rewriting NetworkTables, I think the best solution (in addition to the previous fix) is to make the sockets non-blocking, and use select with a 1-second timeout on writes.

My thought is that anything that blocks a write for more than a second is going to be useless anyways, and NetworkTables has provisions for reconnecting when the connection dies. Better than hanging permanently.

If anyone has feedback on the patch, I'd welcome it.

Our team successfully used the first part of the patch without issues in a week zero event, but we don't have competition until mid-March, so I won't have any hard testing of the patch until then. However, I've tested it extensively on Linux/Windows, and on a cRio-II that was disconnected from actual robot hardware.

wireties 25-02-2014 14:58

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by RufflesRidge (Post 1348120)
I'm fairly certain that is an issue with the VxWorks socket implementation and not caused by WPILib. WPILib does not contain a UDP socket class.

No way it is a problem with the VxWorks network stack - the stack used by Wind River is a derivative of BSD 4.4, as was Linux. So VxWorks is using some of the most heavily exercised/hardened code on the planet. I've been using it for 20+ years without any issues.

Alan Anderson 25-02-2014 15:10

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by wireties (Post 1349567)
No way it is a problem with the VxWorks network stack - the stack used by Wind River is a derivative of BSD 4.4, as was Linux. So VxWorks is using some of the most heavily exercised/hardened code on the planet. I've been using it for 20+ years without any issues.

See http://digital.ni.com/public.nsf/all...2579FE0053DB93 for an explanation of why it is a VxWorks network issue.

taichichuan 25-02-2014 16:02

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Hmm... OK, I can understand this in light of the way cluster_buf management is handled in VxWorks. The MUX layer has to allocate space for inbound packets if the port is open. It assumes that sooner or later you're going to have to read the data. I'm surprised that it affects UDP traffic through. Nonetheless, the behavior should be to drop the packets rather than infinitely queuing them. Maybe this behavior changed when WRS switched from the BSD stack to the Interpeak stack so they never when back to fix it in the release that we use for FIRST. I've been pretty disappointed in WRS's involvment in FIRST over the past couple of years. It's too bad they're not really supporting the community any longer.

In any case, our approach of putting UDP read code in the teleop_disabled routine making this problem go away then makes sense. I would have expected the SO_RCVBUF buffers to fill up and then start dropping data rather than causing the stack to hang though. That's what happens in the Linux case. Since the layer 3 (IP) is considered unreiable, Linux has no problems dropping the packets when the stack gets sufficient backpressure. Again, this shouldn't be a problem with next year's control system thankfully.

virtuald 25-02-2014 16:31

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by taichichuan (Post 1349583)
I'm surprised that it affects UDP traffic through.

The subject of this thread, NetworkTables, does not use UDP. It uses TCP. :)

Quote:

Originally Posted by taichichuan (Post 1349583)
Again, this shouldn't be a problem with next year's control system thankfully.

Not true. This particular NetworkTables bug can be duplicated on Linux also -- though it is a lot harder, because the default send buffers are a lot bigger on a PC.

The real ultimate solution is rewriting NetworkTables for C++.

wireties 25-02-2014 16:35

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by Alan Anderson (Post 1349573)
See http://digital.ni.com/public.nsf/all...2579FE0053DB93 for an explanation of why it is a VxWorks network issue.

Why would a port open for unicast UDP accumulate multicast packets? (the most likely source of the unknown inbound data described by NI). Or are the unexpected packets coming from SD? I wish I had time to look into this - the NI explanation seems shallow. Why is the VxWorks community at-large not reporting this anomaly?

And if this is such a huge problem, why does the SD code not implement one of the workarounds? The NI bug report and workaround recommendation is nearly 2 years old.

Hey Mike, what year did Wind River switch stacks?

wireties 25-02-2014 16:42

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1349597)
The subject of this thread, NetworkTables, does not use UDP. It uses TCP. :)

Interesting - this makes the NI explanation less likely since the bug report says UDP multicast traffic is a likely source for the unexpected data. But it runs counter to Mike's (one of the most knowledgeable VxWorks experts on the planet) successful use of a UDP-focused patch to fix the problem.

Agree with the need for a rewrite - that is some nasty looking spaghetti code!

Joe Ross 25-02-2014 17:28

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by wireties (Post 1349606)
Interesting - this makes the NI explanation less likely since the bug report says UDP multicast traffic is a likely source for the unexpected data. But it runs counter to Mike's (one of the most knowledgeable VxWorks experts on the planet) successful use of a UDP-focused patch to fix the problem.

Realize that there are two separate issues being interwoven in this thread. Mike Anderson said they are not using NT and are instead using UDP, which RufflesRidge responded to and warned about the vxworks issue. No one implied that the vxworks issue is related to the NT issue that Dustin is reporting.

byteit101 25-02-2014 17:28

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1348325)
the C++ implementation of NetworkTables is a *truly* horrible and complex piece of code -- it's clear that it was the author's first experience with writing C++ code (and he admitted this himself when I ran into him at WPI Battle Cry last year). The library needs a rewrite.

Indeed, I heartily agree. (I know C++)

[ancedote]
Before last season I was helping the author clean it up and finish it so it could ship for the 2013 season. twards the end of november I looked at the code and send him a laundry list of suggestions, and among the responses were "Yes I created this in java. I did it in c++ to mirror the java api" and "Yah I know as I said I wasn't the right person to do this". As it was already the end of november, re-writing it was out of the question so I attempted to clean it up a bit. I managed to clean a few things up, like removing the custom UTF16 string class among other things. Then I moved over to making SFX, so never got a change to clean it up. I was hoping to with the C++11 project, but SFX took over my time, and there are not many good C++ devs. Sigh...
[/ancedote]

Anyway, I will attempt to move these patches along, though no guarentees.

virtuald 25-02-2014 17:29

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by wireties (Post 1349606)
But it runs counter to Mike's (one of the most knowledgeable VxWorks experts on the planet) successful use of a UDP-focused patch to fix the problem.

I believe Mike was discussing an internal solution that they created/used to communicate instead of using NeworkTables. Their solution apparently used UDP.

wireties 25-02-2014 17:40

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1349628)
I believe Mike was discussing an internal solution that they created/used to communicate instead of using NeworkTables. Their solution apparently used UDP.

My apologies - I missed this association.

But the fact that the NT problem is TCP-related is not directly associated with the NI bug report, correct? Am I missing something? Is all this NT trouble simply that the TCP socket buffers are not serviced in a timely manner? This could happen in VxWorks or Linux (next year), correct?

I am tempted to go back a few years and build a custom dashboard in LabView.

taichichuan 25-02-2014 18:06

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Hi Keith et al,

I went back and checked when WRS purchased Interpeak. That was in the 2007 time frame and the Interpeak stack was integrated into VxWorks 6.5 (our FIRST code is based on VxWorks 6.3). So, our code appears to be the BSD stack. Also, digging back into my notes from the MUXLib code, there is an opportunity to exhaust the cluster buffers if the traffic is never read. This would affect any networking code regardless of the protocol being used. However, simply having a reader to gobble up the packets when you're not actually reading/using it would be one of the work-arounds. It certainly works for us.

Now, on to the NT implementation... I've heard from many sources that the NT code was terribly flawed and needs to be rewritten. This has been confirmed on this thread. There was also a hack that came out last season that marked the underlying scockets as non-blocking to address horrible latency problems with the FMS that particularly affected C/C++-based 'bots. So, I think we can agree that this code needs to be fixed or tossed.

I'm curious as to why sockets, a technology that's been working for over 30 years and is the basis for nearly all networking code around the world, needed the network tables abstraction in the first place. The socket API is relatively trivial in comparison to the NT implementation. Did someone believe that the students couldn't handle network programming, so they needed to hide it for some reason? It sounds like another case of trying to abstract details away by adding more complexity. I've found that the students can be remarkably resourceful when confronted with such problems. Especially one that is so easy to solve with a little Google Foo.

Sigh, let's hope that they don't bollux up the Linux implementation as well.

Thanks for all of the input on this one. It certainly helps me decide what to focus the students on for next year's preseason.

virtuald 25-02-2014 18:25

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by taichichuan (Post 1349640)
There was also a hack that came out last season that marked the underlying scockets as non-blocking to address horrible latency problems with the FMS that particularly affected C/C++-based 'bots.

Ha, I hadn't seen that last year. My patch also sets the socket to be non-blocking -- BUT I also fix the underlying lock contention which is causing the performance problem. One should generally never write to the network while holding a lock someone else might want to obtain. :)

Quote:

I'm curious as to why sockets, a technology that's been working for over 30 years and is the basis for nearly all networking code around the world, needed the network tables abstraction in the first place. The socket API is relatively trivial in comparison to the NT implementation. Did someone believe that the students couldn't handle network programming, so they needed to hide it for some reason? It sounds like another case of trying to abstract details away by adding more complexity.
I personally really like the idea of transmitting information by key-value pairs, it's a much better abstraction to deal with than having to roll your own byte buffers. I think it's easier to explain to students too.

Additionally, there is value to everyone using the same protocol instead of rolling your own. Being able to use the same tools from multiple languages seems like a huge win. I particularly like the *idea* of SmartDashboard -- I just call PutNumber and the value magically shows up on the remote side. Makes debugging and tuning so much easier!

Joe Ross 25-02-2014 18:28

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by taichichuan (Post 1349640)
There was also a hack that came out last season that marked the underlying scockets as non-blocking to address horrible latency problems with the FMS that particularly affected C/C++-based 'bots. So, I think we can agree that this code needs to be fixed or tossed.

I don't remember that. Are you sure you aren't thinking of the patch that came out and enabled the Nagle algorithm?

taichichuan 25-02-2014 18:46

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Ahh, yes... That was it. Pardon my poor memory. It was the Nagle Algorithm issue they address last season. As I recall, it added ~100ms of latency to the link. Again this was related to NT. The thread is here:

http://www.chiefdelphi.com/forums/sh...socket+latency

virtuald 28-02-2014 23:38

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
A mentor on the team I worked with in Massachusetts was at a district competition today, and noted that a number of teams seem to have freezing/stuttering problems on the field. While he didn't verify that the teams were having other problems (which very may well could be the case), it sounds similar to the problems I was experiencing before the patch.

Due to FIRST not releasing at least an optional official fix for this, I've decided to post an unofficial fixed binary to the original bug report (direct download link), for those who may not know how to fix the problem themselves. For anyone looking at the Driver Station versions, it will show up in diagnostics in 'Lib:' as 'C++ 2014 NT Fix'

Unfortunately, I haven't been able to verify it on a cRio as I don't have access to one at the moment -- if any of you can do this, that would be great. However, the original WPILib binary is 13.0 MB, and this one is 13.1 MB, so I expect that it should work, as it worked for RobotPy's version of WPILib when we used it.

Greg McKaskle 02-03-2014 13:28

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
I didn't write the article on ni.com, but perhaps I can help decode it.

My Take-away: If you open a TCP or UDP port on vxworks and someone writes to it, your program should read from it or it will eventually fill the communications buffer and interfere with communication even on other ports and protocols. This is not true on other OSes LV runs on and not true of other VxWorks versions, but is true of the current version of VxWorks that NI supports on the cRIO.

It sounds like the original bug report may have involved an unexpected arrival of datagrams. The author opened the port to use for outgoing datagrams. The author did not expect it to receive datagrams, never read from the port, and discovered that this would eventually lead to the symptoms listed. The suggested workaround in the article is to read from the port or close and reopen to flush unexpected datagrams even on ports you assume to be write-only.

Since FIRST robots are generally on a controlled network, I don't think the suggestion is necessary. In reference to Einstein, what took place a few years ago involved data from a coprocessor intentionally writing to a UDP port on the cRIO. The thread responsible for reading from it was sometimes spinning, waiting for a sensor value to stabilize. The unattended UDP port filled the buffer and prevented communication on other ports that would have allowed communication to the cRIO -- including the ability to reboot the cRIO. There is of course no way to know that this was exactly what took place on Einstein on that particular robot. But the code would loop indefinitely with a bad or disconnected sensor. It fit the symptoms, and was determined to be the most likely explanation for what was observed on that particular robot.

To the original topic, the original SD protocol was even more complex and was quite difficult to implement. in fact, I decided not to release the LV implementation because I wasn't comfortable with its reliability. The next year, we removed a number of features, simplifying the implementation, and released all three languages. SD offers an alternative to sockets or TCP/UDP. Teams may choose any of these forms of communication on open ports, and since port 80 is open, they could use other forms such as web services.

The issue that affected the field last year in week one was caused by a flood of tiny single byte TCP packets in the C++ implementation. The short-term solution was to allow the OS to buffer the writes using the Nagle algorithm. I don't know if this is still enabled or if the writes were refactored to transmit larger transaction buffers the way the LV implementation does.

I was in San Antonio this weekend, and we saw lockup issues with one C++ team making heavy use of SD and a Java DB. The team chose to disable SD usage and their symptoms seem to have disappeared. Plenty of other teams use SD in C++, Java, and LV in various DB combinations. I'm not aware of other lockup reports from San Antonio. This will be investigated further. I'm sure Brad and the WPI folks appreciate the help with the C++ implementation.

Greg McKaskle

Aren Siekmeier 03-03-2014 01:37

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1351240)
Unfortunately, I haven't been able to verify it on a cRio as I don't have access to one at the moment -- if any of you can do this, that would be great.

We've been working through this problem too, discussed in this thread. Here's what we've come up with so far:

Quote:

Originally Posted by compwiztobe (Post 1352419)
We applied this patch, and rebuilt and redeployed our code. It appeared to improve the problem at first in that it gave the same FDIO errors as we saw previously, about as frequently as we had been getting the timeout errors, and we did not appear to be hanging. However (and we don't really understand what happened here), after a very short time, it started timing out immediately when enabled, giving the same error at line 117 of MotorSafetyHelper.cpp. One interesting thing to note is that when disabled all these errors stopped, so it seems to still be making it through the control loop in IterativeRobot.cpp (StartCompetition()). We have most of these error messages logged, so as soon as someone has access to those text files we will post those.

We have ported all of our code to Java and LabView, and neither of these implementations seem to show the same problem, though we hope to run the bot into the ground a little more over the next few days to tease any more errors out of it. Some C++ code that doesn't use Commands or the SmartDashboard/NetworkTables is in the works so testing that may also give some insight. We will probably stick with the Java implementation for the rest of the season.


JamesTerm 03-03-2014 12:13

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1339935)
A bug report has been filed on the WPILib tracker here: http://firstforge.wpi.edu/sf/go/artf1719

Thanks VirtualD for this find... I wish I could have worked with you in beta this season, as I have as well made improvements to the Network Tables... I originally wanted to make the win32 port, but then as I got it working I needed the shutdown to work properly and discovered lockups during disconnect and reconnect stresses. So I then focused on one issue... and that was how the parent classes deletes the child class... instead of the child class trying to delete itself within its own thread. I also worked with the author to provide a shutdown procedure and ensure there were no memory leaks. We worked together on this over the summer... I've attached https://www.dropbox.com/s/iv3rozae2q...tDashboard.ppt and https://www.dropbox.com/s/9p6jhhnt8y...ard_Client.ppt an object oriented diagram that helps me navigate through the code.

I wish I'd have ran into Greg in San Antonio (that is where I live btw)... as our Robot has also fallen victim to this symptom in 3 matches. One of the FTA guys has captured our log but I haven't heard back from him.

I did not get into the details of guts of the code, but now I'd like to review your changes, and see what can be done to get some official fix for all teams. Thanks so much again... I can't begin to tell you how frustrating this has been... when the team looks at me and ask why our robot is failing... but hey that's ok... we can work this out... I really needed some good testers to test the changes made over the summer... so I'm hoping to hook up with everyone who's had a hand in the Network Tables code, and try to get this fixed properly and be reliable!

RyanShoff 03-03-2014 13:47

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
I believe we went dead in our first match (and the first match of the regional) due to this bug. The drive team tethered and charged the pneumatics with a dashboard running. I not entirely sure if they changed anything on the dashboard. Then they unplugged the tether and placed the robot on the field. The robot never moved in autonomous or teleop until the driver rebooted the crio through the driverstation. The field people said everything looked ok to them.

After this happened, we added a policy of a doing a hard poweroff reboot after placing the robot and not turning on the dashboard during competition. (we don't need it in during a match). The problem never happened again and we didn't change anything else.

We also had two incidents of unintended acceleration before bag and tag, which might be related.

This is a just a heads up to other teams. We aren't going to look into it further since we have a working process. Losing that match turned out not to be a big deal.

It also could just have been the robot having first match jitters.

virtuald 03-03-2014 14:40

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by JamesTerm (Post 1352605)
We worked together on this over the summer... I've attached https://www.dropbox.com/s/iv3rozae2q...tDashboard.ppt and https://www.dropbox.com/s/9p6jhhnt8y...ard_Client.ppt an object oriented diagram that helps me navigate through the code.

Wow. That is a mess. Yet more evidence that it needs to be rewritten. :)

JamesTerm 03-03-2014 17:47

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1352741)
Wow. That is a mess. Yet more evidence that it needs to be rewritten. :)

Yes, that would be an ideal long-term goal if anyone is willing to do it... but for now. I am hoping we can find a short-term quick turn around solution that fixes the root cause and get an official fix for all teams. (I'm going to re-read what Greg posted on that, once I start the code review with the patched changes). I'm keeping an eye on the test results from compwiztobe as well.

I believe one of the issues with this design is how closely it reflects the JAVA language version, where JAVA does not need to manage memory... hence the red arrows in the diagram (something I never used in my designs) show that objects are not being created and destroyed in the same place. This made it somewhat more difficult to track down the memory leaks. I think the redesign should be c++ based design which does abide by c++ conventions and have clean objects that manage memory properly... and then port this to JAVA... going in that direction... JAVA can simply ignore all calls to deletes, or interface them to do nothing... etc. Anyhow I just wanted to point this out for anyone else who is code reviewing.

Aren Siekmeier 04-03-2014 00:11

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by JamesTerm (Post 1352838)
I think the redesign should be c++ based design which does abide by c++ conventions and have clean objects that manage memory properly... and then port this to JAVA... going in that direction... JAVA can simply ignore all calls to deletes, or interface them to do nothing... etc. Anyhow I just wanted to point this out for anyone else who is code reviewing.

This is really the right way to do it since the JVM is often implemented in C++...

JamesTerm 09-03-2014 15:28

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1348325)
PS: In case you're interested, I found another obscure bug in NetworkTables tonight, that causes buffer overflows on my linux box. If you've ever wondered why you see gibberish in Netconsole when a NetworkTables client disconnects, I found out why.

Thanks for posting this... this is a great fix!

Aren Siekmeier 09-03-2014 21:08

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1348325)
PS: In case you're interested, I found another obscure bug in NetworkTables tonight, that causes buffer overflows on my linux box. If you've ever wondered why you see gibberish in Netconsole when a NetworkTables client disconnects, I found out why.

I believe this is the other symptom we saw when we first encountered these problems this year. See the gibberish resulting from two error messages writing to NetConsole simultaneously shown in this post.

Code:

write error: : read error: : S_errno_EPIPE
S_errno_ETIMEDOUT
[[NNTT]]  IIOOEExxcceeppttiioonn  mmeessssaaggee::  CEorurlodr  noont  FwDrIiOt er eaaldl
 [[bNyTt]e s0 xt2o0 ef8dd 1s8t reenatme
r[eNdT ]c o0nxn2e8cet8ido1n8  setnatteer:e dS EcRoVnEnRe_cEtRiRoOnR
state: SERVER_ERROR
[NT] Close: 0x28e8d18


JamesTerm 09-03-2014 21:27

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by compwiztobe (Post 1356095)
I believe this is the other symptom we saw when we first encountered these problems this year. See the gibberish resulting from two error messages writing to NetConsole simultaneously shown in this post.

Yes I was able to reproduce that as well, and I've confirmed the patch work has fixed this issue... today (9 Mar 14) I have updated this https://www.dropbox.com/s/dbpv32qii7...sDustinFix.zip file to include that patch... along with the previous fixes I talked about before... it is doing much better now. There is still one minor issue left... I'm going to see if I can fix it.

JamesTerm 10-03-2014 11:20

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by JamesTerm (Post 1352838)
Yes, that would be an ideal long-term goal if anyone is willing to do it...

I want to expand a bit on this... one thing I have realized later in life is that time gets shorter and it seems like code is never done. There are so many things I'd have liked to have done in code, but can't find the time to do them. So as we sharpen our skills in coding we have less time to do it (especially if you have kids). So with this kind of perspective I've learned that we try to bring the most functionality forward even though it may not be perfect. When I was younger... I'd love to re-invent the wheel as I could learn from it, but then at some point I learned how to read other people's code and figure out what they intended to do. This is perhaps the most important skill for a software engineer as there is so much existing code out there... it is important as this is needed to be able to work with a team of engineers, and deal with any imperfections, or coding style differences... when all is said and done... it will be just a few lines of code here and there... corrected, and this Network Tables will become a reliable solution. Grant it, there is some code out there that cannot be salvaged, but this code doesn't fall into that category as I think it was well thought out... in spite of the nit-picky things I mentioned previously. I am a pragmatic person who works with imperfect code all the time... we get it done enough to get the job done... if it works... and works reliably then we win... without re-inventing and losing time and going through another cycle of bugs to fix.

virtuald 10-03-2014 11:34

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by JamesTerm (Post 1356450)
Grant it, there is some code out there that cannot be salvaged, but this code doesn't fall into that category as I think is was well thought out... in spite of the nit-picky things I mentioned previously.

Disagree. There are architectural changes that should be made to the code. It's far more complex than what is needed to do the task at hand, and it's very difficult to follow. Because of the complexity, changes that are made are at a higher risk of breaking other things in subtle ways.

Additionally, there is a *lot* a 'bad smelling code' -- stupid things like casting pointers to references and back again (which is related to the author wanting to use 'new' for everything -- which wasn't always necessary), using a union to hold multiple types without any reliable way to determine what is in the union, one byte read/writes to sockets … and this is just the beginning.

Because of these (and other things), most of the code should be scrapped. It will be easier to rewrite the code correctly than to try and fix its problems.

JamesTerm 10-03-2014 12:06

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1356464)
Because of these (and other things), most of the code should be scrapped. It will be easier to rewrite the code correctly than to try and fix its problems.

The problems should already be fixed... thanks for what you posted on these as well! :) Given this status, whether or not it is easier does not matter to me... what matters is how long will it take. For those who have time to rewrite that is wonderful... but as for me, I have other code to write! ;)

There is just one minor issue left I want to get done hopefully before Dallas, but this issue should be invisible to users even in its current state. All of your other points are valid, but until someone takes the time to do a rewrite... this is all we got, so I'm going to make it work for our team, and anyone else who wants to use it. (Up to this point... it was your goal as well).

Once it is proven to be reliable... any new rewrite will introduce new potential risk, and unfortunately it is difficult to find people who have time to test this properly... otherwise we'd have fixed it before now. There is value in code that has been well-tested in spite of its imperfections... if it is proven reliable... that's really all that any user cares about. Have you seen my changes? The author had the same idea on this fix as well.

virtuald 10-03-2014 12:15

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Sure, it does work, and our team is planning on using it in its current state. However, I don't intend to rewrite it either, I also don't have the time at the moment. ;)

NetworkTables is a really useful idea, and so I think it would be worth it for the maintainer of the code (eg, FIRST/WPI) to make the code better -- there's a lot of ways it could be improved. However, if they're going to make improvements in the future, then my recommendation is to rewrite it -- and create some unit tests/etc. If done properly, it will make it a lot easier to fix any such bugs in the future.

JamesTerm 10-03-2014 12:26

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1356489)
NetworkTables is a really useful idea, and so I think it would be worth it for the maintainer of the code (eg, FIRST/WPI) to make the code better -- there's a lot of ways it could be improved. However, if they're going to make improvements in the future, then my recommendation is to rewrite it -- and create some unit tests/etc. If done properly, it will make it a lot easier to fix any such bugs in the future.

I agree with that... it seems now that all efforts are currently focused on SmartDashboard 2.0. That's a whole another subject I don't want to get into now.

It should be noted that I've been using a win32 version of network tables port quite a bit... I've found it even useful in NewTek development to watch variables that dynamically change. That said... it has never once locked up or crashed... so this is partly why I could never find the issue... I can't reproduce it in win32 environment. I believe the last issue remaining deals with the time it takes to connect to the time it takes to make the initial first write. I'm thinking of putting a sleep in there as well as taking a closer look at ConnectionMonitorThread::run()... I suspect this thread may not be sleeping in some cases... but I could be wrong.

NotInControl 10-03-2014 16:17

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Hello,

I am joing this converstation late. We just competed in our first district event this past weekend and notices some strange anomolies which maybe contributed to the problem identified here.

We use smartdashboard to send booleans back to a dashboard to let the drivers know certain events or robot states have been reached.

A few times during pit testing we noticed that we could not successfully command the robot. The robot was in a hung state. Comms were up, but the dashboard was frozen, and while a button press did register on the default dash, no response was displayed by the robot. Restarting the Robot, AND restarting the driverstation seemed to be the only way to get arround this and re-establish comms.

I was just writing to make sure I understand the bug because we use Java and it was unclear if this problem was just for C++, on the client side/ ther server side, both...

Based on what I read it appears the crux of the problem is that although NT is multithreaded, it holds on to a lock during a write sequence. A write sequence which also blocks and keeps the lock if the write fails. The robot thread uses this same lock to push data, thus causing the hang on the robot. Is this all correct?

My question is the code to obtain the lock on the robotside in NT or in SmartDash? If you were to call the smartDashboard putXXX methods in a separate thread, would that not at least prevent the hang?

I have not spent any time looking into the NT/SD implemenation, yet, but I will.

Again we are using Java on the robot, and the pyNetorkTables port provided by Dustin on the driverstation.

Regards,
Kevin

JamesTerm 10-03-2014 17:32

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by NotInControl (Post 1356714)
Based on what I read it appears the crux of the problem is that although NT is multithreaded, it holds on to a lock during a write sequence. A write sequence which also blocks and keeps the lock if the write fails. The robot thread uses this same lock to push data, thus causing the hang on the robot. Is this all correct?
Kevin

Yes


Quote:

Originally Posted by NotInControl (Post 1356714)
My question is the code to obtain the lock on the robotside in NT or in SmartDash? If you were to call the smartDashboard putXXX methods in a separate thread, would that not at least prevent the hang?

I have not spent any time looking into the NT/SD implemenation, yet, but I will.

The fix is similar to this idea... because the lock used in the call to Put() spans just the read and writes of the mapped variable and ID. The other work it does is non-blocked. The same is true for using the Get() methods too. SmartDashboard Get and Put calls (at least for c++ code) are just an interface/wrapper to the Network Tables. I have not looked at the Java port, but since this was developed from a Java perspective... I'll bet it is nearly the same as the c++ port.

virtuald 10-03-2014 21:00

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by NotInControl (Post 1356714)
I was just writing to make sure I understand the bug because we use Java and it was unclear if this problem was just for C++, on the client side/ ther server side, both…

The problem will affect both the client and server side.

I believe the problem is most severe on C++, but that similar problems may affect the java side, as others have reported less severe problems in Java. I've only glanced at the java version recently, and a cursory look shows the java code does not hold the lock the same way that the C++ code, but my expectation is that there is a similar bug somewhere, given that it's written by the same author.

Quote:

Based on what I read it appears the crux of the problem is that although NT is multithreaded, it holds on to a lock during a write sequence. A write sequence which also blocks and keeps the lock if the write fails. The robot thread uses this same lock to push data, thus causing the hang on the robot. Is this all correct?
Yup.

Quote:

My question is the code to obtain the lock on the robotside in NT or in SmartDash? If you were to call the smartDashboard putXXX methods in a separate thread, would that not at least prevent the hang?
That should prevent the hang.

virtuald 10-03-2014 21:03

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1356925)
The problem will affect both the client and server side.

It is worth noting that I've never actually tried to get the client to hang -- but I think they use the same code to write to the socket, so I would expect the problem to exist there.

Aren Siekmeier 10-03-2014 23:22

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by NotInControl (Post 1356714)
... we use Java ...

We switched from C++ to Java after seeing our issues, and while it was hard to reproduce with C++, everything we have tried so far with Java has shown no sign of the problem.

NotInControl 11-03-2014 18:12

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by compwiztobe (Post 1357067)
We switched from C++ to Java after seeing our issues, and while it was hard to reproduce with C++, everything we have tried so far with Java has shown no sign of the problem.

That is interesting. I have to admit I have not yet confirmed this to be the cause of the symptoms I saw on our robot in the pits this passed weekend.

The robot in question is now bagged, however I will be trying to recreate these problems on our practice bot over the next few days.

The symptoms expressed in this thread were very similar to the symptoms we saw which is why I think this bug may be a suspect.

However, we have always had all of our smartdashboard calls in a separate thread that gets started on robot init. The reason for this is to reduce the amounts of writes per second.

The only smarthdashboard call I have that is running in the same thread as the robot thread is our autonomous sendable chooser which runs in the disabledPeriodic() block.

We are going to do testing with and without this function call to see if we can get the robot to hang again. During our quick diagnostics in the pits, the only way we could re-establish full comms was by restarting the robot, and the driverstation/dashboard. Doing just one or the other was not enough to correct the problem.

I am more concerned with preventing the robot from hanging then having my dashboard work.

We have never seen this problem on the field, as we always have a standard practice to shut robot off, and exit all dashboard/driverstaion windows prior to every match.

JamesTerm 11-03-2014 22:30

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by JamesTerm (Post 1356495)
I believe the last issue remaining deals with the time it takes to connect to the time it takes to make the initial first write. I'm thinking of putting a sleep in there as well as taking a closer look at ConnectionMonitorThread::run()... I suspect this thread may not be sleeping in some cases... but I could be wrong.

No, no, no... I was wrong. I found this bug! grrrrr, ok it's now fixed... same place to get the zip and patch. Here's the full patch including Dustin's fixes. https://www.dropbox.com/s/f4mcx9x0hj...tinPatch.patch

I'll explain... this latest fix is for the client side code mostly... what was happening is during the time when the server (robot) loses connection like rebooting the cRIO. The client code was still trying to issue reads and throwing exceptions... the fix knows when this has been closed and when the reconnect has been issued... so during that time it will stop issuing the reads. On some platforms (e.g. win32) the read would return bogus data, which is another issue, but the most important thing is that it should be calling the read when it knows it should succeed.

This has been a week of hair pulling for me... but now I think it is good to go. Of course the key to the success of this (like anything else) is a lot of testing. All of the other fixes are just as important as this one... they all are needed to resolve issue. I'm looking forward to hearing back from anyone who wants to test it before the official release. Thanks.

Now I'm signing off of this task... and going back to other code. :)

JamesTerm 16-03-2014 10:56

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
For anyone who has been following up on this thread, I just wanted to say that we tested the smart dashboard and network tables (with the James/Dustin patches) for the Dallas regionals with no issue. I gotta say I felt a little bit of anxiety the first 3-4 matches, but felt more confident as the days progressed... we left driver station running on full time with SmartDashboard and Driver Station windows always on where this tests the stress of cRIO reconnect on existing connections. We also use the GetNumber() for autonomous ball count. It always maintained the correct ball count throughout the day. I am hoping more teams will use this again once these patches are officially released. I'll post back here when they are.

virtuald 19-03-2014 18:17

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Update: FIRST has released an official stable release that should address the problem. It will not be a required update for teams, but if you use NetworkTables I'd highly recommend it. It can be downloaded here: http://first.wpi.edu/FRC/c/update/Stable/

JamesTerm 19-03-2014 18:41

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Thanks Dustin for the posting... I'll keep an ear out here for any issues that may arise.

Joe Ross 19-03-2014 19:42

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Looking through the source, it looks like artf1712 was also fixed, as well as http://forums.usfirst.org/showthread...ive-data-rates

virtuald 22-03-2014 23:53

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
At the Virginia Regional this weekend, I helped out a team using Java that would inexplicably go to 100% CPU and all control would drop out. While there were some definite problems with their code, it turned out that when they commented out all the SmartDashboard code, the problems stopped happening. Very odd.

kylelanman 23-03-2014 02:36

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
We were aware of this bug and attempted to avoid the situation by always power cycling before placing the robot on the field. I am nearly certain we reproduced this bug on the practice field. We had a faulty ethernet cable. During robot power on the connection was intermittent. The dashboard never came to life. We ended up having to hard power cycle the robot and restart the dashboard. Restarting the dashboard may not have been necessary. But we did them in tandem and the dashboard came back to life. Minutes after this we applied the patch and had no NT problems the rest of the regional.

JamesTerm 23-03-2014 09:40

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1362815)
At the Virginia Regional this weekend, I helped out a team using Java that would inexplicably go to 100% CPU and all control would drop out. While there were some definite problems with their code, it turned out that when they commented out all the SmartDashboard code, the problems stopped happening. Very odd.

Was this team 2481, and was the 3886 patch already applied before these problems occurred?

To kylelanman: What programming language are you using?

virtuald 23-03-2014 10:23

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by JamesTerm (Post 1362903)
Was this team 2481, and was the 3886 patch already applied before these problems occurred?

It was not team 2481, and the team was using Java, so the 3886 patch would not apply.

kylelanman 23-03-2014 23:38

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by JamesTerm (Post 1362903)
To kylelanman: What programming language are you using?

C++

JamesTerm 24-03-2014 17:49

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1362919)
It was not team 2481, and the team was using Java, so the 3886 patch would not apply.

Ah ok... I looked over the JAVA source and it is different in its synchronization locking management, and from what I've heard there's been no known issues with it like there was for c++. I just wanted to make sure that the patch fix does not have any more outstanding issues... and so now I know this patch does not apply to JAVA teams. If this is the only known issue it could be a red herring. I will however keep an ear out for JAVA issues too... in case we need to code review it for next season.

Joe Ross 25-03-2014 19:47

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
The update is now availible in the release folder as announced in today's team update. http://first.wpi.edu/FRC/c/update/Re...325rev3887.exe

MamaSpoldi 26-03-2014 14:25

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
At the risk of sounding ignorant in this excellent technical discussion...

Am I correct in thinking that even if we do not explicitly perform any NetworkTables operations that we could be affected by the bug in question and therefore need the update? We use the SmartDashboard only for simple display operations, eg. calling PutXXX to display values on the dashboard screen. It sounds like these operations use the NetworkTables behind the scenes and are therefore subject to this issue. So I wanted to verify if we need to install the update.

FYI, we are using C++ on the robot and the SmartDashboard on the driverstation.

Also, does this update include a change to the .jar file for the dashboard implementation which runs on the driverstation laptop or just a change to the library code built into the application that runs on the cRIO? I ask so that we know if it needs to be installed on the driverstation as well as the programming laptop.

Thanks.

virtuald 26-03-2014 14:43

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by MamaSpoldi (Post 1365158)
At the risk of sounding ignorant in this excellent technical discussion...

Am I correct in thinking that even if we do not explicitly perform any NetworkTables operations that we could be affected by the bug in question and therefore need the update? We use the SmartDashboard only for simple display operations, eg. calling PutXXX to display values on the dashboard screen. It sounds like these operations use the NetworkTables behind the scenes and are therefore subject to this issue. So I wanted to verify if we need to install the update.

FYI, we are using C++ on the robot and the SmartDashboard on the driverstation.

Also, does this update include a change to the .jar file for the dashboard implementation which runs on the driverstation laptop or just a change to the library code built into the application that runs on the cRIO? I ask so that we know if it needs to be installed on the driverstation as well as the programming laptop.

Thanks.

Yes, SmartDashboard operations in C++ uses NetworkTables underneath the covers. You should apply the update.

The driver station SmartDashboard jar is not affected by this update.

Speaking to the update's stability: Our team uses python, which has an earlier version of the update applied. We did not experience a single freeze/lockup event during our regional last weekend.

MamaSpoldi 26-03-2014 15:20

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Dustin - Thanks for that clarification. :)


All times are GMT -5. The time now is 03:50.

Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
Copyright © Chief Delphi