![]() |
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
Quote:
And if this is such a huge problem, why does the SD code not implement one of the workarounds? The NI bug report and workaround recommendation is nearly 2 years old. Hey Mike, what year did Wind River switch stacks? |
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
Quote:
Agree with the need for a rewrite - that is some nasty looking spaghetti code! |
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
Quote:
|
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
Quote:
[ancedote] Before last season I was helping the author clean it up and finish it so it could ship for the 2013 season. twards the end of november I looked at the code and send him a laundry list of suggestions, and among the responses were "Yes I created this in java. I did it in c++ to mirror the java api" and "Yah I know as I said I wasn't the right person to do this". As it was already the end of november, re-writing it was out of the question so I attempted to clean it up a bit. I managed to clean a few things up, like removing the custom UTF16 string class among other things. Then I moved over to making SFX, so never got a change to clean it up. I was hoping to with the C++11 project, but SFX took over my time, and there are not many good C++ devs. Sigh... [/ancedote] Anyway, I will attempt to move these patches along, though no guarentees. |
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
Quote:
|
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
Quote:
But the fact that the NT problem is TCP-related is not directly associated with the NI bug report, correct? Am I missing something? Is all this NT trouble simply that the TCP socket buffers are not serviced in a timely manner? This could happen in VxWorks or Linux (next year), correct? I am tempted to go back a few years and build a custom dashboard in LabView. |
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
Hi Keith et al,
I went back and checked when WRS purchased Interpeak. That was in the 2007 time frame and the Interpeak stack was integrated into VxWorks 6.5 (our FIRST code is based on VxWorks 6.3). So, our code appears to be the BSD stack. Also, digging back into my notes from the MUXLib code, there is an opportunity to exhaust the cluster buffers if the traffic is never read. This would affect any networking code regardless of the protocol being used. However, simply having a reader to gobble up the packets when you're not actually reading/using it would be one of the work-arounds. It certainly works for us. Now, on to the NT implementation... I've heard from many sources that the NT code was terribly flawed and needs to be rewritten. This has been confirmed on this thread. There was also a hack that came out last season that marked the underlying scockets as non-blocking to address horrible latency problems with the FMS that particularly affected C/C++-based 'bots. So, I think we can agree that this code needs to be fixed or tossed. I'm curious as to why sockets, a technology that's been working for over 30 years and is the basis for nearly all networking code around the world, needed the network tables abstraction in the first place. The socket API is relatively trivial in comparison to the NT implementation. Did someone believe that the students couldn't handle network programming, so they needed to hide it for some reason? It sounds like another case of trying to abstract details away by adding more complexity. I've found that the students can be remarkably resourceful when confronted with such problems. Especially one that is so easy to solve with a little Google Foo. Sigh, let's hope that they don't bollux up the Linux implementation as well. Thanks for all of the input on this one. It certainly helps me decide what to focus the students on for next year's preseason. |
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
Quote:
Quote:
Additionally, there is value to everyone using the same protocol instead of rolling your own. Being able to use the same tools from multiple languages seems like a huge win. I particularly like the *idea* of SmartDashboard -- I just call PutNumber and the value magically shows up on the remote side. Makes debugging and tuning so much easier! |
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
Quote:
|
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
Ahh, yes... That was it. Pardon my poor memory. It was the Nagle Algorithm issue they address last season. As I recall, it added ~100ms of latency to the link. Again this was related to NT. The thread is here:
http://www.chiefdelphi.com/forums/sh...socket+latency |
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
A mentor on the team I worked with in Massachusetts was at a district competition today, and noted that a number of teams seem to have freezing/stuttering problems on the field. While he didn't verify that the teams were having other problems (which very may well could be the case), it sounds similar to the problems I was experiencing before the patch.
Due to FIRST not releasing at least an optional official fix for this, I've decided to post an unofficial fixed binary to the original bug report (direct download link), for those who may not know how to fix the problem themselves. For anyone looking at the Driver Station versions, it will show up in diagnostics in 'Lib:' as 'C++ 2014 NT Fix' Unfortunately, I haven't been able to verify it on a cRio as I don't have access to one at the moment -- if any of you can do this, that would be great. However, the original WPILib binary is 13.0 MB, and this one is 13.1 MB, so I expect that it should work, as it worked for RobotPy's version of WPILib when we used it. |
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
I didn't write the article on ni.com, but perhaps I can help decode it.
My Take-away: If you open a TCP or UDP port on vxworks and someone writes to it, your program should read from it or it will eventually fill the communications buffer and interfere with communication even on other ports and protocols. This is not true on other OSes LV runs on and not true of other VxWorks versions, but is true of the current version of VxWorks that NI supports on the cRIO. It sounds like the original bug report may have involved an unexpected arrival of datagrams. The author opened the port to use for outgoing datagrams. The author did not expect it to receive datagrams, never read from the port, and discovered that this would eventually lead to the symptoms listed. The suggested workaround in the article is to read from the port or close and reopen to flush unexpected datagrams even on ports you assume to be write-only. Since FIRST robots are generally on a controlled network, I don't think the suggestion is necessary. In reference to Einstein, what took place a few years ago involved data from a coprocessor intentionally writing to a UDP port on the cRIO. The thread responsible for reading from it was sometimes spinning, waiting for a sensor value to stabilize. The unattended UDP port filled the buffer and prevented communication on other ports that would have allowed communication to the cRIO -- including the ability to reboot the cRIO. There is of course no way to know that this was exactly what took place on Einstein on that particular robot. But the code would loop indefinitely with a bad or disconnected sensor. It fit the symptoms, and was determined to be the most likely explanation for what was observed on that particular robot. To the original topic, the original SD protocol was even more complex and was quite difficult to implement. in fact, I decided not to release the LV implementation because I wasn't comfortable with its reliability. The next year, we removed a number of features, simplifying the implementation, and released all three languages. SD offers an alternative to sockets or TCP/UDP. Teams may choose any of these forms of communication on open ports, and since port 80 is open, they could use other forms such as web services. The issue that affected the field last year in week one was caused by a flood of tiny single byte TCP packets in the C++ implementation. The short-term solution was to allow the OS to buffer the writes using the Nagle algorithm. I don't know if this is still enabled or if the writes were refactored to transmit larger transaction buffers the way the LV implementation does. I was in San Antonio this weekend, and we saw lockup issues with one C++ team making heavy use of SD and a Java DB. The team chose to disable SD usage and their symptoms seem to have disappeared. Plenty of other teams use SD in C++, Java, and LV in various DB combinations. I'm not aware of other lockup reports from San Antonio. This will be investigated further. I'm sure Brad and the WPI folks appreciate the help with the C++ implementation. Greg McKaskle |
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
Quote:
Quote:
|
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
Quote:
I wish I'd have ran into Greg in San Antonio (that is where I live btw)... as our Robot has also fallen victim to this symptom in 3 matches. One of the FTA guys has captured our log but I haven't heard back from him. I did not get into the details of guts of the code, but now I'd like to review your changes, and see what can be done to get some official fix for all teams. Thanks so much again... I can't begin to tell you how frustrating this has been... when the team looks at me and ask why our robot is failing... but hey that's ok... we can work this out... I really needed some good testers to test the changes made over the summer... so I'm hoping to hook up with everyone who's had a hand in the Network Tables code, and try to get this fixed properly and be reliable! |
Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
I believe we went dead in our first match (and the first match of the regional) due to this bug. The drive team tethered and charged the pneumatics with a dashboard running. I not entirely sure if they changed anything on the dashboard. Then they unplugged the tether and placed the robot on the field. The robot never moved in autonomous or teleop until the driver rebooted the crio through the driverstation. The field people said everything looked ok to them.
After this happened, we added a policy of a doing a hard poweroff reboot after placing the robot and not turning on the dashboard during competition. (we don't need it in during a match). The problem never happened again and we didn't change anything else. We also had two incidents of unintended acceleration before bag and tag, which might be related. This is a just a heads up to other teams. We aren't going to look into it further since we have a working process. Losing that match turned out not to be a big deal. It also could just have been the robot having first match jitters. |
| All times are GMT -5. The time now is 03:50. |
Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
Copyright © Chief Delphi