Chief Delphi

Chief Delphi (http://www.chiefdelphi.com/forums/index.php)
-   C/C++ (http://www.chiefdelphi.com/forums/forumdisplay.php?f=183)
-   -   Serious bug identified in SmartDashboard/NetworkTables -- robot hangs (http://www.chiefdelphi.com/forums/showthread.php?t=126102)

wireties 25-02-2014 16:35

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by Alan Anderson (Post 1349573)
See http://digital.ni.com/public.nsf/all...2579FE0053DB93 for an explanation of why it is a VxWorks network issue.

Why would a port open for unicast UDP accumulate multicast packets? (the most likely source of the unknown inbound data described by NI). Or are the unexpected packets coming from SD? I wish I had time to look into this - the NI explanation seems shallow. Why is the VxWorks community at-large not reporting this anomaly?

And if this is such a huge problem, why does the SD code not implement one of the workarounds? The NI bug report and workaround recommendation is nearly 2 years old.

Hey Mike, what year did Wind River switch stacks?

wireties 25-02-2014 16:42

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1349597)
The subject of this thread, NetworkTables, does not use UDP. It uses TCP. :)

Interesting - this makes the NI explanation less likely since the bug report says UDP multicast traffic is a likely source for the unexpected data. But it runs counter to Mike's (one of the most knowledgeable VxWorks experts on the planet) successful use of a UDP-focused patch to fix the problem.

Agree with the need for a rewrite - that is some nasty looking spaghetti code!

Joe Ross 25-02-2014 17:28

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by wireties (Post 1349606)
Interesting - this makes the NI explanation less likely since the bug report says UDP multicast traffic is a likely source for the unexpected data. But it runs counter to Mike's (one of the most knowledgeable VxWorks experts on the planet) successful use of a UDP-focused patch to fix the problem.

Realize that there are two separate issues being interwoven in this thread. Mike Anderson said they are not using NT and are instead using UDP, which RufflesRidge responded to and warned about the vxworks issue. No one implied that the vxworks issue is related to the NT issue that Dustin is reporting.

byteit101 25-02-2014 17:28

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1348325)
the C++ implementation of NetworkTables is a *truly* horrible and complex piece of code -- it's clear that it was the author's first experience with writing C++ code (and he admitted this himself when I ran into him at WPI Battle Cry last year). The library needs a rewrite.

Indeed, I heartily agree. (I know C++)

[ancedote]
Before last season I was helping the author clean it up and finish it so it could ship for the 2013 season. twards the end of november I looked at the code and send him a laundry list of suggestions, and among the responses were "Yes I created this in java. I did it in c++ to mirror the java api" and "Yah I know as I said I wasn't the right person to do this". As it was already the end of november, re-writing it was out of the question so I attempted to clean it up a bit. I managed to clean a few things up, like removing the custom UTF16 string class among other things. Then I moved over to making SFX, so never got a change to clean it up. I was hoping to with the C++11 project, but SFX took over my time, and there are not many good C++ devs. Sigh...
[/ancedote]

Anyway, I will attempt to move these patches along, though no guarentees.

virtuald 25-02-2014 17:29

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by wireties (Post 1349606)
But it runs counter to Mike's (one of the most knowledgeable VxWorks experts on the planet) successful use of a UDP-focused patch to fix the problem.

I believe Mike was discussing an internal solution that they created/used to communicate instead of using NeworkTables. Their solution apparently used UDP.

wireties 25-02-2014 17:40

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1349628)
I believe Mike was discussing an internal solution that they created/used to communicate instead of using NeworkTables. Their solution apparently used UDP.

My apologies - I missed this association.

But the fact that the NT problem is TCP-related is not directly associated with the NI bug report, correct? Am I missing something? Is all this NT trouble simply that the TCP socket buffers are not serviced in a timely manner? This could happen in VxWorks or Linux (next year), correct?

I am tempted to go back a few years and build a custom dashboard in LabView.

taichichuan 25-02-2014 18:06

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Hi Keith et al,

I went back and checked when WRS purchased Interpeak. That was in the 2007 time frame and the Interpeak stack was integrated into VxWorks 6.5 (our FIRST code is based on VxWorks 6.3). So, our code appears to be the BSD stack. Also, digging back into my notes from the MUXLib code, there is an opportunity to exhaust the cluster buffers if the traffic is never read. This would affect any networking code regardless of the protocol being used. However, simply having a reader to gobble up the packets when you're not actually reading/using it would be one of the work-arounds. It certainly works for us.

Now, on to the NT implementation... I've heard from many sources that the NT code was terribly flawed and needs to be rewritten. This has been confirmed on this thread. There was also a hack that came out last season that marked the underlying scockets as non-blocking to address horrible latency problems with the FMS that particularly affected C/C++-based 'bots. So, I think we can agree that this code needs to be fixed or tossed.

I'm curious as to why sockets, a technology that's been working for over 30 years and is the basis for nearly all networking code around the world, needed the network tables abstraction in the first place. The socket API is relatively trivial in comparison to the NT implementation. Did someone believe that the students couldn't handle network programming, so they needed to hide it for some reason? It sounds like another case of trying to abstract details away by adding more complexity. I've found that the students can be remarkably resourceful when confronted with such problems. Especially one that is so easy to solve with a little Google Foo.

Sigh, let's hope that they don't bollux up the Linux implementation as well.

Thanks for all of the input on this one. It certainly helps me decide what to focus the students on for next year's preseason.

virtuald 25-02-2014 18:25

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by taichichuan (Post 1349640)
There was also a hack that came out last season that marked the underlying scockets as non-blocking to address horrible latency problems with the FMS that particularly affected C/C++-based 'bots.

Ha, I hadn't seen that last year. My patch also sets the socket to be non-blocking -- BUT I also fix the underlying lock contention which is causing the performance problem. One should generally never write to the network while holding a lock someone else might want to obtain. :)

Quote:

I'm curious as to why sockets, a technology that's been working for over 30 years and is the basis for nearly all networking code around the world, needed the network tables abstraction in the first place. The socket API is relatively trivial in comparison to the NT implementation. Did someone believe that the students couldn't handle network programming, so they needed to hide it for some reason? It sounds like another case of trying to abstract details away by adding more complexity.
I personally really like the idea of transmitting information by key-value pairs, it's a much better abstraction to deal with than having to roll your own byte buffers. I think it's easier to explain to students too.

Additionally, there is value to everyone using the same protocol instead of rolling your own. Being able to use the same tools from multiple languages seems like a huge win. I particularly like the *idea* of SmartDashboard -- I just call PutNumber and the value magically shows up on the remote side. Makes debugging and tuning so much easier!

Joe Ross 25-02-2014 18:28

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by taichichuan (Post 1349640)
There was also a hack that came out last season that marked the underlying scockets as non-blocking to address horrible latency problems with the FMS that particularly affected C/C++-based 'bots. So, I think we can agree that this code needs to be fixed or tossed.

I don't remember that. Are you sure you aren't thinking of the patch that came out and enabled the Nagle algorithm?

taichichuan 25-02-2014 18:46

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Ahh, yes... That was it. Pardon my poor memory. It was the Nagle Algorithm issue they address last season. As I recall, it added ~100ms of latency to the link. Again this was related to NT. The thread is here:

http://www.chiefdelphi.com/forums/sh...socket+latency

virtuald 28-02-2014 23:38

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
A mentor on the team I worked with in Massachusetts was at a district competition today, and noted that a number of teams seem to have freezing/stuttering problems on the field. While he didn't verify that the teams were having other problems (which very may well could be the case), it sounds similar to the problems I was experiencing before the patch.

Due to FIRST not releasing at least an optional official fix for this, I've decided to post an unofficial fixed binary to the original bug report (direct download link), for those who may not know how to fix the problem themselves. For anyone looking at the Driver Station versions, it will show up in diagnostics in 'Lib:' as 'C++ 2014 NT Fix'

Unfortunately, I haven't been able to verify it on a cRio as I don't have access to one at the moment -- if any of you can do this, that would be great. However, the original WPILib binary is 13.0 MB, and this one is 13.1 MB, so I expect that it should work, as it worked for RobotPy's version of WPILib when we used it.

Greg McKaskle 02-03-2014 13:28

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
I didn't write the article on ni.com, but perhaps I can help decode it.

My Take-away: If you open a TCP or UDP port on vxworks and someone writes to it, your program should read from it or it will eventually fill the communications buffer and interfere with communication even on other ports and protocols. This is not true on other OSes LV runs on and not true of other VxWorks versions, but is true of the current version of VxWorks that NI supports on the cRIO.

It sounds like the original bug report may have involved an unexpected arrival of datagrams. The author opened the port to use for outgoing datagrams. The author did not expect it to receive datagrams, never read from the port, and discovered that this would eventually lead to the symptoms listed. The suggested workaround in the article is to read from the port or close and reopen to flush unexpected datagrams even on ports you assume to be write-only.

Since FIRST robots are generally on a controlled network, I don't think the suggestion is necessary. In reference to Einstein, what took place a few years ago involved data from a coprocessor intentionally writing to a UDP port on the cRIO. The thread responsible for reading from it was sometimes spinning, waiting for a sensor value to stabilize. The unattended UDP port filled the buffer and prevented communication on other ports that would have allowed communication to the cRIO -- including the ability to reboot the cRIO. There is of course no way to know that this was exactly what took place on Einstein on that particular robot. But the code would loop indefinitely with a bad or disconnected sensor. It fit the symptoms, and was determined to be the most likely explanation for what was observed on that particular robot.

To the original topic, the original SD protocol was even more complex and was quite difficult to implement. in fact, I decided not to release the LV implementation because I wasn't comfortable with its reliability. The next year, we removed a number of features, simplifying the implementation, and released all three languages. SD offers an alternative to sockets or TCP/UDP. Teams may choose any of these forms of communication on open ports, and since port 80 is open, they could use other forms such as web services.

The issue that affected the field last year in week one was caused by a flood of tiny single byte TCP packets in the C++ implementation. The short-term solution was to allow the OS to buffer the writes using the Nagle algorithm. I don't know if this is still enabled or if the writes were refactored to transmit larger transaction buffers the way the LV implementation does.

I was in San Antonio this weekend, and we saw lockup issues with one C++ team making heavy use of SD and a Java DB. The team chose to disable SD usage and their symptoms seem to have disappeared. Plenty of other teams use SD in C++, Java, and LV in various DB combinations. I'm not aware of other lockup reports from San Antonio. This will be investigated further. I'm sure Brad and the WPI folks appreciate the help with the C++ implementation.

Greg McKaskle

Aren Siekmeier 03-03-2014 01:37

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1351240)
Unfortunately, I haven't been able to verify it on a cRio as I don't have access to one at the moment -- if any of you can do this, that would be great.

We've been working through this problem too, discussed in this thread. Here's what we've come up with so far:

Quote:

Originally Posted by compwiztobe (Post 1352419)
We applied this patch, and rebuilt and redeployed our code. It appeared to improve the problem at first in that it gave the same FDIO errors as we saw previously, about as frequently as we had been getting the timeout errors, and we did not appear to be hanging. However (and we don't really understand what happened here), after a very short time, it started timing out immediately when enabled, giving the same error at line 117 of MotorSafetyHelper.cpp. One interesting thing to note is that when disabled all these errors stopped, so it seems to still be making it through the control loop in IterativeRobot.cpp (StartCompetition()). We have most of these error messages logged, so as soon as someone has access to those text files we will post those.

We have ported all of our code to Java and LabView, and neither of these implementations seem to show the same problem, though we hope to run the bot into the ground a little more over the next few days to tease any more errors out of it. Some C++ code that doesn't use Commands or the SmartDashboard/NetworkTables is in the works so testing that may also give some insight. We will probably stick with the Java implementation for the rest of the season.


JamesTerm 03-03-2014 12:13

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1339935)
A bug report has been filed on the WPILib tracker here: http://firstforge.wpi.edu/sf/go/artf1719

Thanks VirtualD for this find... I wish I could have worked with you in beta this season, as I have as well made improvements to the Network Tables... I originally wanted to make the win32 port, but then as I got it working I needed the shutdown to work properly and discovered lockups during disconnect and reconnect stresses. So I then focused on one issue... and that was how the parent classes deletes the child class... instead of the child class trying to delete itself within its own thread. I also worked with the author to provide a shutdown procedure and ensure there were no memory leaks. We worked together on this over the summer... I've attached https://www.dropbox.com/s/iv3rozae2q...tDashboard.ppt and https://www.dropbox.com/s/9p6jhhnt8y...ard_Client.ppt an object oriented diagram that helps me navigate through the code.

I wish I'd have ran into Greg in San Antonio (that is where I live btw)... as our Robot has also fallen victim to this symptom in 3 matches. One of the FTA guys has captured our log but I haven't heard back from him.

I did not get into the details of guts of the code, but now I'd like to review your changes, and see what can be done to get some official fix for all teams. Thanks so much again... I can't begin to tell you how frustrating this has been... when the team looks at me and ask why our robot is failing... but hey that's ok... we can work this out... I really needed some good testers to test the changes made over the summer... so I'm hoping to hook up with everyone who's had a hand in the Network Tables code, and try to get this fixed properly and be reliable!

RyanShoff 03-03-2014 13:47

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
I believe we went dead in our first match (and the first match of the regional) due to this bug. The drive team tethered and charged the pneumatics with a dashboard running. I not entirely sure if they changed anything on the dashboard. Then they unplugged the tether and placed the robot on the field. The robot never moved in autonomous or teleop until the driver rebooted the crio through the driverstation. The field people said everything looked ok to them.

After this happened, we added a policy of a doing a hard poweroff reboot after placing the robot and not turning on the dashboard during competition. (we don't need it in during a match). The problem never happened again and we didn't change anything else.

We also had two incidents of unintended acceleration before bag and tag, which might be related.

This is a just a heads up to other teams. We aren't going to look into it further since we have a working process. Losing that match turned out not to be a big deal.

It also could just have been the robot having first match jitters.


All times are GMT -5. The time now is 03:50.

Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
Copyright © Chief Delphi