Serious bug identified in SmartDashboard/NetworkTables -- robot hangs

My apologies - I missed this association.

But the fact that the NT problem is TCP-related is not directly associated with the NI bug report, correct? Am I missing something? Is all this NT trouble simply that the TCP socket buffers are not serviced in a timely manner? This could happen in VxWorks or Linux (next year), correct?

I am tempted to go back a few years and build a custom dashboard in LabView.

Hi Keith et al,

I went back and checked when WRS purchased Interpeak. That was in the 2007 time frame and the Interpeak stack was integrated into VxWorks 6.5 (our FIRST code is based on VxWorks 6.3). So, our code appears to be the BSD stack. Also, digging back into my notes from the MUXLib code, there is an opportunity to exhaust the cluster buffers if the traffic is never read. This would affect any networking code regardless of the protocol being used. However, simply having a reader to gobble up the packets when you’re not actually reading/using it would be one of the work-arounds. It certainly works for us.

Now, on to the NT implementation… I’ve heard from many sources that the NT code was terribly flawed and needs to be rewritten. This has been confirmed on this thread. There was also a hack that came out last season that marked the underlying scockets as non-blocking to address horrible latency problems with the FMS that particularly affected C/C+±based 'bots. So, I think we can agree that this code needs to be fixed or tossed.

I’m curious as to why sockets, a technology that’s been working for over 30 years and is the basis for nearly all networking code around the world, needed the network tables abstraction in the first place. The socket API is relatively trivial in comparison to the NT implementation. Did someone believe that the students couldn’t handle network programming, so they needed to hide it for some reason? It sounds like another case of trying to abstract details away by adding more complexity. I’ve found that the students can be remarkably resourceful when confronted with such problems. Especially one that is so easy to solve with a little Google Foo.

Sigh, let’s hope that they don’t bollux up the Linux implementation as well.

Thanks for all of the input on this one. It certainly helps me decide what to focus the students on for next year’s preseason.

Ha, I hadn’t seen that last year. My patch also sets the socket to be non-blocking – BUT I also fix the underlying lock contention which is causing the performance problem. One should generally never write to the network while holding a lock someone else might want to obtain. :slight_smile:

I’m curious as to why sockets, a technology that’s been working for over 30 years and is the basis for nearly all networking code around the world, needed the network tables abstraction in the first place. The socket API is relatively trivial in comparison to the NT implementation. Did someone believe that the students couldn’t handle network programming, so they needed to hide it for some reason? It sounds like another case of trying to abstract details away by adding more complexity.

I personally really like the idea of transmitting information by key-value pairs, it’s a much better abstraction to deal with than having to roll your own byte buffers. I think it’s easier to explain to students too.

Additionally, there is value to everyone using the same protocol instead of rolling your own. Being able to use the same tools from multiple languages seems like a huge win. I particularly like the idea of SmartDashboard – I just call PutNumber and the value magically shows up on the remote side. Makes debugging and tuning so much easier!

I don’t remember that. Are you sure you aren’t thinking of the patch that came out and enabled the Nagle algorithm?

Ahh, yes… That was it. Pardon my poor memory. It was the Nagle Algorithm issue they address last season. As I recall, it added ~100ms of latency to the link. Again this was related to NT. The thread is here:

http://www.chiefdelphi.com/forums/showthread.php?t=114665&highlight=socket+latency

A mentor on the team I worked with in Massachusetts was at a district competition today, and noted that a number of teams seem to have freezing/stuttering problems on the field. While he didn’t verify that the teams were having other problems (which very may well could be the case), it sounds similar to the problems I was experiencing before the patch.

Due to FIRST not releasing at least an optional official fix for this, I’ve decided to post an unofficial fixed binary to the original bug report (direct download link), for those who may not know how to fix the problem themselves. For anyone looking at the Driver Station versions, it will show up in diagnostics in ‘Lib:’ as ‘C++ 2014 NT Fix’

Unfortunately, I haven’t been able to verify it on a cRio as I don’t have access to one at the moment – if any of you can do this, that would be great. However, the original WPILib binary is 13.0 MB, and this one is 13.1 MB, so I expect that it should work, as it worked for RobotPy’s version of WPILib when we used it.

I didn’t write the article on ni.com, but perhaps I can help decode it.

My Take-away: If you open a TCP or UDP port on vxworks and someone writes to it, your program should read from it or it will eventually fill the communications buffer and interfere with communication even on other ports and protocols. This is not true on other OSes LV runs on and not true of other VxWorks versions, but is true of the current version of VxWorks that NI supports on the cRIO.

It sounds like the original bug report may have involved an unexpected arrival of datagrams. The author opened the port to use for outgoing datagrams. The author did not expect it to receive datagrams, never read from the port, and discovered that this would eventually lead to the symptoms listed. The suggested workaround in the article is to read from the port or close and reopen to flush unexpected datagrams even on ports you assume to be write-only.

Since FIRST robots are generally on a controlled network, I don’t think the suggestion is necessary. In reference to Einstein, what took place a few years ago involved data from a coprocessor intentionally writing to a UDP port on the cRIO. The thread responsible for reading from it was sometimes spinning, waiting for a sensor value to stabilize. The unattended UDP port filled the buffer and prevented communication on other ports that would have allowed communication to the cRIO – including the ability to reboot the cRIO. There is of course no way to know that this was exactly what took place on Einstein on that particular robot. But the code would loop indefinitely with a bad or disconnected sensor. It fit the symptoms, and was determined to be the most likely explanation for what was observed on that particular robot.

To the original topic, the original SD protocol was even more complex and was quite difficult to implement. in fact, I decided not to release the LV implementation because I wasn’t comfortable with its reliability. The next year, we removed a number of features, simplifying the implementation, and released all three languages. SD offers an alternative to sockets or TCP/UDP. Teams may choose any of these forms of communication on open ports, and since port 80 is open, they could use other forms such as web services.

The issue that affected the field last year in week one was caused by a flood of tiny single byte TCP packets in the C++ implementation. The short-term solution was to allow the OS to buffer the writes using the Nagle algorithm. I don’t know if this is still enabled or if the writes were refactored to transmit larger transaction buffers the way the LV implementation does.

I was in San Antonio this weekend, and we saw lockup issues with one C++ team making heavy use of SD and a Java DB. The team chose to disable SD usage and their symptoms seem to have disappeared. Plenty of other teams use SD in C++, Java, and LV in various DB combinations. I’m not aware of other lockup reports from San Antonio. This will be investigated further. I’m sure Brad and the WPI folks appreciate the help with the C++ implementation.

Greg McKaskle

We’ve been working through this problem too, discussed in this thread. Here’s what we’ve come up with so far:

Thanks VirtualD for this find… I wish I could have worked with you in beta this season, as I have as well made improvements to the Network Tables… I originally wanted to make the win32 port, but then as I got it working I needed the shutdown to work properly and discovered lockups during disconnect and reconnect stresses. So I then focused on one issue… and that was how the parent classes deletes the child class… instead of the child class trying to delete itself within its own thread. I also worked with the author to provide a shutdown procedure and ensure there were no memory leaks. We worked together on this over the summer… I’ve attached https://www.dropbox.com/s/iv3rozae2qhynj6/SmartDashboard.ppt and https://www.dropbox.com/s/9p6jhhnt8ya6vup/SmartDashboard_Client.ppt an object oriented diagram that helps me navigate through the code.

I wish I’d have ran into Greg in San Antonio (that is where I live btw)… as our Robot has also fallen victim to this symptom in 3 matches. One of the FTA guys has captured our log but I haven’t heard back from him.

I did not get into the details of guts of the code, but now I’d like to review your changes, and see what can be done to get some official fix for all teams. Thanks so much again… I can’t begin to tell you how frustrating this has been… when the team looks at me and ask why our robot is failing… but hey that’s ok… we can work this out… I really needed some good testers to test the changes made over the summer… so I’m hoping to hook up with everyone who’s had a hand in the Network Tables code, and try to get this fixed properly and be reliable!

I believe we went dead in our first match (and the first match of the regional) due to this bug. The drive team tethered and charged the pneumatics with a dashboard running. I not entirely sure if they changed anything on the dashboard. Then they unplugged the tether and placed the robot on the field. The robot never moved in autonomous or teleop until the driver rebooted the crio through the driverstation. The field people said everything looked ok to them.

After this happened, we added a policy of a doing a hard poweroff reboot after placing the robot and not turning on the dashboard during competition. (we don’t need it in during a match). The problem never happened again and we didn’t change anything else.

We also had two incidents of unintended acceleration before bag and tag, which might be related.

This is a just a heads up to other teams. We aren’t going to look into it further since we have a working process. Losing that match turned out not to be a big deal.

It also could just have been the robot having first match jitters.

Wow. That is a mess. Yet more evidence that it needs to be rewritten. :slight_smile:

Yes, that would be an ideal long-term goal if anyone is willing to do it… but for now. I am hoping we can find a short-term quick turn around solution that fixes the root cause and get an official fix for all teams. (I’m going to re-read what Greg posted on that, once I start the code review with the patched changes). I’m keeping an eye on the test results from compwiztobe as well.

I believe one of the issues with this design is how closely it reflects the JAVA language version, where JAVA does not need to manage memory… hence the red arrows in the diagram (something I never used in my designs) show that objects are not being created and destroyed in the same place. This made it somewhat more difficult to track down the memory leaks. I think the redesign should be c++ based design which does abide by c++ conventions and have clean objects that manage memory properly… and then port this to JAVA… going in that direction… JAVA can simply ignore all calls to deletes, or interface them to do nothing… etc. Anyhow I just wanted to point this out for anyone else who is code reviewing.

This is really the right way to do it since the JVM is often implemented in C++…

Thanks for posting this… this is a great fix!

I believe this is the other symptom we saw when we first encountered these problems this year. See the gibberish resulting from two error messages writing to NetConsole simultaneously shown in this post.


write error: : read error: : S_errno_EPIPE
S_errno_ETIMEDOUT
[NNTT]]  IIOOEExxcceeppttiioonn  mmeessssaaggee::  CEorurlodr  noont  FwDrIiOt er eaaldl
 [bNyTt]e s0 xt2o0 ef8dd 1s8t reenatme
r[eNdT ]c o0nxn2e8cet8ido1n8  setnatteer:e dS EcRoVnEnRe_cEtRiRoOnR
state: SERVER_ERROR
[NT] Close: 0x28e8d18

Yes I was able to reproduce that as well, and I’ve confirmed the patch work has fixed this issue… today (9 Mar 14) I have updated this https://www.dropbox.com/s/dbpv32qii7i7jet/JamesDustinFix.zip file to include that patch… along with the previous fixes I talked about before… it is doing much better now. There is still one minor issue left… I’m going to see if I can fix it.

I want to expand a bit on this… one thing I have realized later in life is that time gets shorter and it seems like code is never done. There are so many things I’d have liked to have done in code, but can’t find the time to do them. So as we sharpen our skills in coding we have less time to do it (especially if you have kids). So with this kind of perspective I’ve learned that we try to bring the most functionality forward even though it may not be perfect. When I was younger… I’d love to re-invent the wheel as I could learn from it, but then at some point I learned how to read other people’s code and figure out what they intended to do. This is perhaps the most important skill for a software engineer as there is so much existing code out there… it is important as this is needed to be able to work with a team of engineers, and deal with any imperfections, or coding style differences… when all is said and done… it will be just a few lines of code here and there… corrected, and this Network Tables will become a reliable solution. Grant it, there is some code out there that cannot be salvaged, but this code doesn’t fall into that category as I think it was well thought out… in spite of the nit-picky things I mentioned previously. I am a pragmatic person who works with imperfect code all the time… we get it done enough to get the job done… if it works… and works reliably then we win… without re-inventing and losing time and going through another cycle of bugs to fix.

Disagree. There are architectural changes that should be made to the code. It’s far more complex than what is needed to do the task at hand, and it’s very difficult to follow. Because of the complexity, changes that are made are at a higher risk of breaking other things in subtle ways.

Additionally, there is a lot a ‘bad smelling code’ – stupid things like casting pointers to references and back again (which is related to the author wanting to use ‘new’ for everything – which wasn’t always necessary), using a union to hold multiple types without any reliable way to determine what is in the union, one byte read/writes to sockets … and this is just the beginning.

Because of these (and other things), most of the code should be scrapped. It will be easier to rewrite the code correctly than to try and fix its problems.

The problems should already be fixed… thanks for what you posted on these as well! :slight_smile: Given this status, whether or not it is easier does not matter to me… what matters is how long will it take. For those who have time to rewrite that is wonderful… but as for me, I have other code to write! :wink:

There is just one minor issue left I want to get done hopefully before Dallas, but this issue should be invisible to users even in its current state. All of your other points are valid, but until someone takes the time to do a rewrite… this is all we got, so I’m going to make it work for our team, and anyone else who wants to use it. (Up to this point… it was your goal as well).

Once it is proven to be reliable… any new rewrite will introduce new potential risk, and unfortunately it is difficult to find people who have time to test this properly… otherwise we’d have fixed it before now. There is value in code that has been well-tested in spite of its imperfections… if it is proven reliable… that’s really all that any user cares about. Have you seen my changes? The author had the same idea on this fix as well.

Sure, it does work, and our team is planning on using it in its current state. However, I don’t intend to rewrite it either, I also don’t have the time at the moment. :wink:

NetworkTables is a really useful idea, and so I think it would be worth it for the maintainer of the code (eg, FIRST/WPI) to make the code better – there’s a lot of ways it could be improved. However, if they’re going to make improvements in the future, then my recommendation is to rewrite it – and create some unit tests/etc. If done properly, it will make it a lot easier to fix any such bugs in the future.