Chief Delphi

Chief Delphi (http://www.chiefdelphi.com/forums/index.php)
-   C/C++ (http://www.chiefdelphi.com/forums/forumdisplay.php?f=183)
-   -   Serious bug identified in SmartDashboard/NetworkTables -- robot hangs (http://www.chiefdelphi.com/forums/showthread.php?t=126102)

virtuald 03-03-2014 14:40

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by JamesTerm (Post 1352605)
We worked together on this over the summer... I've attached https://www.dropbox.com/s/iv3rozae2q...tDashboard.ppt and https://www.dropbox.com/s/9p6jhhnt8y...ard_Client.ppt an object oriented diagram that helps me navigate through the code.

Wow. That is a mess. Yet more evidence that it needs to be rewritten. :)

JamesTerm 03-03-2014 17:47

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1352741)
Wow. That is a mess. Yet more evidence that it needs to be rewritten. :)

Yes, that would be an ideal long-term goal if anyone is willing to do it... but for now. I am hoping we can find a short-term quick turn around solution that fixes the root cause and get an official fix for all teams. (I'm going to re-read what Greg posted on that, once I start the code review with the patched changes). I'm keeping an eye on the test results from compwiztobe as well.

I believe one of the issues with this design is how closely it reflects the JAVA language version, where JAVA does not need to manage memory... hence the red arrows in the diagram (something I never used in my designs) show that objects are not being created and destroyed in the same place. This made it somewhat more difficult to track down the memory leaks. I think the redesign should be c++ based design which does abide by c++ conventions and have clean objects that manage memory properly... and then port this to JAVA... going in that direction... JAVA can simply ignore all calls to deletes, or interface them to do nothing... etc. Anyhow I just wanted to point this out for anyone else who is code reviewing.

Aren Siekmeier 04-03-2014 00:11

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by JamesTerm (Post 1352838)
I think the redesign should be c++ based design which does abide by c++ conventions and have clean objects that manage memory properly... and then port this to JAVA... going in that direction... JAVA can simply ignore all calls to deletes, or interface them to do nothing... etc. Anyhow I just wanted to point this out for anyone else who is code reviewing.

This is really the right way to do it since the JVM is often implemented in C++...

JamesTerm 09-03-2014 15:28

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1348325)
PS: In case you're interested, I found another obscure bug in NetworkTables tonight, that causes buffer overflows on my linux box. If you've ever wondered why you see gibberish in Netconsole when a NetworkTables client disconnects, I found out why.

Thanks for posting this... this is a great fix!

Aren Siekmeier 09-03-2014 21:08

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1348325)
PS: In case you're interested, I found another obscure bug in NetworkTables tonight, that causes buffer overflows on my linux box. If you've ever wondered why you see gibberish in Netconsole when a NetworkTables client disconnects, I found out why.

I believe this is the other symptom we saw when we first encountered these problems this year. See the gibberish resulting from two error messages writing to NetConsole simultaneously shown in this post.

Code:

write error: : read error: : S_errno_EPIPE
S_errno_ETIMEDOUT
[[NNTT]]  IIOOEExxcceeppttiioonn  mmeessssaaggee::  CEorurlodr  noont  FwDrIiOt er eaaldl
 [[bNyTt]e s0 xt2o0 ef8dd 1s8t reenatme
r[eNdT ]c o0nxn2e8cet8ido1n8  setnatteer:e dS EcRoVnEnRe_cEtRiRoOnR
state: SERVER_ERROR
[NT] Close: 0x28e8d18


JamesTerm 09-03-2014 21:27

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by compwiztobe (Post 1356095)
I believe this is the other symptom we saw when we first encountered these problems this year. See the gibberish resulting from two error messages writing to NetConsole simultaneously shown in this post.

Yes I was able to reproduce that as well, and I've confirmed the patch work has fixed this issue... today (9 Mar 14) I have updated this https://www.dropbox.com/s/dbpv32qii7...sDustinFix.zip file to include that patch... along with the previous fixes I talked about before... it is doing much better now. There is still one minor issue left... I'm going to see if I can fix it.

JamesTerm 10-03-2014 11:20

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by JamesTerm (Post 1352838)
Yes, that would be an ideal long-term goal if anyone is willing to do it...

I want to expand a bit on this... one thing I have realized later in life is that time gets shorter and it seems like code is never done. There are so many things I'd have liked to have done in code, but can't find the time to do them. So as we sharpen our skills in coding we have less time to do it (especially if you have kids). So with this kind of perspective I've learned that we try to bring the most functionality forward even though it may not be perfect. When I was younger... I'd love to re-invent the wheel as I could learn from it, but then at some point I learned how to read other people's code and figure out what they intended to do. This is perhaps the most important skill for a software engineer as there is so much existing code out there... it is important as this is needed to be able to work with a team of engineers, and deal with any imperfections, or coding style differences... when all is said and done... it will be just a few lines of code here and there... corrected, and this Network Tables will become a reliable solution. Grant it, there is some code out there that cannot be salvaged, but this code doesn't fall into that category as I think it was well thought out... in spite of the nit-picky things I mentioned previously. I am a pragmatic person who works with imperfect code all the time... we get it done enough to get the job done... if it works... and works reliably then we win... without re-inventing and losing time and going through another cycle of bugs to fix.

virtuald 10-03-2014 11:34

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by JamesTerm (Post 1356450)
Grant it, there is some code out there that cannot be salvaged, but this code doesn't fall into that category as I think is was well thought out... in spite of the nit-picky things I mentioned previously.

Disagree. There are architectural changes that should be made to the code. It's far more complex than what is needed to do the task at hand, and it's very difficult to follow. Because of the complexity, changes that are made are at a higher risk of breaking other things in subtle ways.

Additionally, there is a *lot* a 'bad smelling code' -- stupid things like casting pointers to references and back again (which is related to the author wanting to use 'new' for everything -- which wasn't always necessary), using a union to hold multiple types without any reliable way to determine what is in the union, one byte read/writes to sockets … and this is just the beginning.

Because of these (and other things), most of the code should be scrapped. It will be easier to rewrite the code correctly than to try and fix its problems.

JamesTerm 10-03-2014 12:06

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1356464)
Because of these (and other things), most of the code should be scrapped. It will be easier to rewrite the code correctly than to try and fix its problems.

The problems should already be fixed... thanks for what you posted on these as well! :) Given this status, whether or not it is easier does not matter to me... what matters is how long will it take. For those who have time to rewrite that is wonderful... but as for me, I have other code to write! ;)

There is just one minor issue left I want to get done hopefully before Dallas, but this issue should be invisible to users even in its current state. All of your other points are valid, but until someone takes the time to do a rewrite... this is all we got, so I'm going to make it work for our team, and anyone else who wants to use it. (Up to this point... it was your goal as well).

Once it is proven to be reliable... any new rewrite will introduce new potential risk, and unfortunately it is difficult to find people who have time to test this properly... otherwise we'd have fixed it before now. There is value in code that has been well-tested in spite of its imperfections... if it is proven reliable... that's really all that any user cares about. Have you seen my changes? The author had the same idea on this fix as well.

virtuald 10-03-2014 12:15

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Sure, it does work, and our team is planning on using it in its current state. However, I don't intend to rewrite it either, I also don't have the time at the moment. ;)

NetworkTables is a really useful idea, and so I think it would be worth it for the maintainer of the code (eg, FIRST/WPI) to make the code better -- there's a lot of ways it could be improved. However, if they're going to make improvements in the future, then my recommendation is to rewrite it -- and create some unit tests/etc. If done properly, it will make it a lot easier to fix any such bugs in the future.

JamesTerm 10-03-2014 12:26

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1356489)
NetworkTables is a really useful idea, and so I think it would be worth it for the maintainer of the code (eg, FIRST/WPI) to make the code better -- there's a lot of ways it could be improved. However, if they're going to make improvements in the future, then my recommendation is to rewrite it -- and create some unit tests/etc. If done properly, it will make it a lot easier to fix any such bugs in the future.

I agree with that... it seems now that all efforts are currently focused on SmartDashboard 2.0. That's a whole another subject I don't want to get into now.

It should be noted that I've been using a win32 version of network tables port quite a bit... I've found it even useful in NewTek development to watch variables that dynamically change. That said... it has never once locked up or crashed... so this is partly why I could never find the issue... I can't reproduce it in win32 environment. I believe the last issue remaining deals with the time it takes to connect to the time it takes to make the initial first write. I'm thinking of putting a sleep in there as well as taking a closer look at ConnectionMonitorThread::run()... I suspect this thread may not be sleeping in some cases... but I could be wrong.

NotInControl 10-03-2014 16:17

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Hello,

I am joing this converstation late. We just competed in our first district event this past weekend and notices some strange anomolies which maybe contributed to the problem identified here.

We use smartdashboard to send booleans back to a dashboard to let the drivers know certain events or robot states have been reached.

A few times during pit testing we noticed that we could not successfully command the robot. The robot was in a hung state. Comms were up, but the dashboard was frozen, and while a button press did register on the default dash, no response was displayed by the robot. Restarting the Robot, AND restarting the driverstation seemed to be the only way to get arround this and re-establish comms.

I was just writing to make sure I understand the bug because we use Java and it was unclear if this problem was just for C++, on the client side/ ther server side, both...

Based on what I read it appears the crux of the problem is that although NT is multithreaded, it holds on to a lock during a write sequence. A write sequence which also blocks and keeps the lock if the write fails. The robot thread uses this same lock to push data, thus causing the hang on the robot. Is this all correct?

My question is the code to obtain the lock on the robotside in NT or in SmartDash? If you were to call the smartDashboard putXXX methods in a separate thread, would that not at least prevent the hang?

I have not spent any time looking into the NT/SD implemenation, yet, but I will.

Again we are using Java on the robot, and the pyNetorkTables port provided by Dustin on the driverstation.

Regards,
Kevin

JamesTerm 10-03-2014 17:32

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by NotInControl (Post 1356714)
Based on what I read it appears the crux of the problem is that although NT is multithreaded, it holds on to a lock during a write sequence. A write sequence which also blocks and keeps the lock if the write fails. The robot thread uses this same lock to push data, thus causing the hang on the robot. Is this all correct?
Kevin

Yes


Quote:

Originally Posted by NotInControl (Post 1356714)
My question is the code to obtain the lock on the robotside in NT or in SmartDash? If you were to call the smartDashboard putXXX methods in a separate thread, would that not at least prevent the hang?

I have not spent any time looking into the NT/SD implemenation, yet, but I will.

The fix is similar to this idea... because the lock used in the call to Put() spans just the read and writes of the mapped variable and ID. The other work it does is non-blocked. The same is true for using the Get() methods too. SmartDashboard Get and Put calls (at least for c++ code) are just an interface/wrapper to the Network Tables. I have not looked at the Java port, but since this was developed from a Java perspective... I'll bet it is nearly the same as the c++ port.

virtuald 10-03-2014 21:00

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by NotInControl (Post 1356714)
I was just writing to make sure I understand the bug because we use Java and it was unclear if this problem was just for C++, on the client side/ ther server side, both…

The problem will affect both the client and server side.

I believe the problem is most severe on C++, but that similar problems may affect the java side, as others have reported less severe problems in Java. I've only glanced at the java version recently, and a cursory look shows the java code does not hold the lock the same way that the C++ code, but my expectation is that there is a similar bug somewhere, given that it's written by the same author.

Quote:

Based on what I read it appears the crux of the problem is that although NT is multithreaded, it holds on to a lock during a write sequence. A write sequence which also blocks and keeps the lock if the write fails. The robot thread uses this same lock to push data, thus causing the hang on the robot. Is this all correct?
Yup.

Quote:

My question is the code to obtain the lock on the robotside in NT or in SmartDash? If you were to call the smartDashboard putXXX methods in a separate thread, would that not at least prevent the hang?
That should prevent the hang.

virtuald 10-03-2014 21:03

Re: Serious bug identified in SmartDashboard/NetworkTables -- robot hangs
 
Quote:

Originally Posted by virtuald (Post 1356925)
The problem will affect both the client and server side.

It is worth noting that I've never actually tried to get the client to hang -- but I think they use the same code to write to the socket, so I would expect the problem to exist there.


All times are GMT -5. The time now is 03:50.

Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
Copyright © Chief Delphi