Shuffleboard reliability issues

Hey all,

Has anyone been having issues with Shuffleboard lately? For seemingly no reason, some of the values we try to put on it just won’t appear, while others will infinitely place themselves in the same square and cause performance issues. The strange thing about this is that the bugs only occur sometimes, and it’s never with the same entries.

We’ve gone over code multiple times and are confident that it’s not a software bug on our end, and we still haven’t been able to intentionally recreate the bug. It sort of just happens at random. What’s probably the weirdest thing of all is that none of these issues happen when we simulate the robot. Every simulation has worked perfectly fine thus far, so it’s leading me to assume that there’s something wrong with Networktables or some other connectivity-related problem.

2 Likes

This is definitely not a “you” problem. This is definitely due to the fact that shuffleboard is not as well maintained as the other things in wpilib.

Best practices state (didn’t always work), to open up a fresh instance of shuffle board after connecting to the robot.

This practice at least helps us boot up as expected 80% of the time.

Are you seeing this with 2023.4.1? (Both robot side and Shuffleboard side)

1 Like

No, we haven’t moved to 4.1 release yet. Probably tonight or tomorrow for the students.

Has happened since the beginning of the season, all the way through 2023.3.2.

Robot side I have not seen it crash on memory since I updated to 2023.4.1, but Shuffleboard is still doing some strange things.

The values on the widgets on the shuffleboard will hang, but if you expand the tab to see the raw network table values they are still changing. Closing Shuffleboard and reopening seems to make the widget values work again.

Ok. Sounds like there’s still an issue (likely on the Shuffleboard side) that we need to track down then. How consistently does it occur? It could be worth running shuffleboard from the console (java -jar shuffleboard.jar) to see if there’s any exceptions being reported when this occurs.

It seemed to be pretty consistent if we had shuffleboard open for a while. We were working on tuning our subsystems so we had the same shuffleboard session open and were redeploying code for a good amount of time until the widgets froze. This evening I will try to run it through the console see if we see anything.

I have all the values I print out freeze to 0 roughly ever half dozen times I deploy code. It is seemingly random and I have to restart Shuffleboard to get it to work.

I think this is expected, but it also uses about 60% of my CPU when a graph is made.

This is with 2023.4.1 on both robot and Shuffleboard? I want to make sure we are isolating to issues that are still present in the latest version.

Yes, the plots high cpu is a known problem that was also present last year.

Apologies, this was on 2023.1.1. I will update today and see if I get the same issues.

We have been seeing a problem with Shuffleboard that is still happening on 2023.4.2 (robot and SB). What we see is that after a period of time of successful deploys and test runs, you will approach the DS to run a test and SB widgets will be blank, though they had data moments before. Normally, the widgets always display the last data sent. We have to restart SB to get our widgets to display data. This happens one or two times a day. Haven’t observed the data disappear, it’s just gone the next time you look at it. Have seen this after DS PC has been closed overnight, but not always. Also observe this after a deploy, but no problem on most deploys. When the widgets are blank, SB indicates Network tables are connected. Been trying to define this issue further without success.

This is a scary proposition for competition.

We have also experienced this, and the only solution found so far is to restart the laptop whenever all values get set to zero at random and cannot be updated. It often occurs (without certainty, this is non reproducible and could be bias in when we check the layout) when we create new values to SmartDashboard. The blank version of the bug has occurred less than the zeroing, but still once in a random interval. An overnight sleep is perceptually guaranteed to cause this and if time allows we will test this purposely and update with the results. To speculate, this could be some kind of disconnect between the SmartDashboard and DS, or a failure of the communication component on either. We have not observed it fail live.

When we have this issue, we rarely have to restart the laptop but it does require that we kill Shuffleboard AND restart the robot code. Once the robot code has run, then opening a new Shuffleboard is back to normal

I am not sure if it is a related issue. But we have been having a similar issue where glass or smart dashboard will randomly refuse to update values. IE when tuning pid loops over Glass or Smart dash randomly changes the values fails to update/change, the original value is displayed even after trying to change it. after a few tries 3-7 times doing the same thing it will eventually be set. No idea what is causing the problem.

That definitely shouldn’t be happening. A potential theory is that timestamps are somehow getting desynchronized in such a way that the current value on the server is seen as newer, but it would have to be significantly offset. Are you setting the same entry from multiple dashboards?

Note in Glass you have to hit enter for a value to be set, clicking away will result in the old value reappearing. You probably know this, but it can be easy to miss when setting multiple things.

It sounds like there’s some Shuffleboard oddity that needs further investigation too.

1 Like

So normally We only have glass and Driver station running. We have tried with smart dashboard and glass running and with just glass running. Same issue but it is hard to test because it appears to happen randomly.

And yes we changed the enter key on glass via the drop down to prevent the DS from disabling and that is how we have been setting values via glass.

The first thing that came to my mind that might be an issue Is I do not know what version of game tools they are running on the PC. Where I am sure that the most up to date Rio image and wpilib are running. So not sure if that could cause a potential issue. (Hard to keep track of all the students different game tools installs)

Is there any diagnostic data that would be helpful to you that I could send over the next time I see the issue?

I have not had to reboot PC or robot, just restart Shuffleboard. Note we are using SB in NT3 support mode (if that is correct terminology). No time to look at converting to NT4 and until this issue appeared, staying on NT3 seemed to work fine. Not sure when this started, but definitely on 2023 WPILib.

Note that when this happens data is being updated in the NT (seen with outlineviewer), just not being recognized by SB and when the happens, all of the SB widgets are blank, so it’s not just a matter of not updating new data, the widgets are cleared.

Widgets in shuffleboard will be disabled if they think the data source behind them is inactive (i.e. the NT key they’re bound to isn’t being sent any more, either because NetworkTables is disconnected or the entry is just outright missing). That’s what this bit of code is handling.

The sources drawer on the left-hand side is not backed by those data sources. Basically, it looks at the NT changes in realtime to get a set of all the unique tables and entries and displays them in a nested tree format. If this is updating, then it’s unlikely to be NetworkTables doing funky things, but the data source thinking that the backing NetworkTable entry has gone away. This could be caused by:

  1. The entry is unpublished (i.e. deleted by the publisher [robot, coprocessor]). This is intended behavior, since we don’t want to imply you can control a value that’s been explicitly removed by the owner - either by it being turned off or deleting the entry outright.
  2. The value in the entry is no longer the same type as what’s currently held by the data source. This is also intended behavior, since there’s no guarantee that clients connected to those data sources would support the new data type. There’s an argument to be made that it should be up to the widget - not the data source - to disable itself if the backing data changes to an incompatible form.

What might be happening is that shuffleboard sees a bunch of “unpublish” events, marks the data sources as inactive, then doesn’t re-initialize the state of those sources when values start coming back, keeping them stuck in that inactive state. I don’t think power cycling the robot would fix it; shuffleboard would clear out its cached data sources, but doesn’t disconnect the widgets (however, creating a new widget from an entry in the sources tree should work fine after the robot starts back up).

One sticking point on this theory is that I’d expect this behavior on every code deploy or power cycle, which clearly isn’t the case. There might be a race condition somewhere that’s muddying things, or maybe there’s a scenario I’m not seeing in which certain NT events are missed and don’t bring the sources back online.

I have done zero testing for any of this; it’s all based on what I remember from when I was designing the widget/source model back in 2017 (wow! it’s been a while) and from looking at the source code on github. I could be wrong on the specific cause, but the files I linked are very likely to be where the bug would live if it’s on the shuffleboard side.

We’ve been struggling with this too. During our evening sessions in the shop we’ll need to restart Shuffleboard maybe 5-10 times. Some widgets will keep updating, but others will show stale data, or just 0. If we choose New from the File menu, only some of the widgets repopulate. We’ve been upgrading with each WPILib update and we’re currently running 2023.4.2.

Occasionally, Shuffleboard will disconnect so we’ll restart it. After restarting it, it’ll appear to be working, but a flood of messages will print out in riolog. The only way we’ve found to stop this is to restart/redeploy the robot code.

NT: DISCONNECTED NT4 client 'shuffleboard@4' (from 10.70.28.185:58794): remote end closed connection
NT: NT4 socket error: operation canceled
NT: Got a NT4 connection from 10.70.28.185 port 58797
NT: CONNECTED NT4 client 'shuffleboard@4' (from 10.70.28.185:58797)
NT: DISCONNECTED NT4 client 'shuffleboard@4' (from 10.70.28.185:58797): remote end closed connection
NT: NT4 socket error: operation canceled
NT: Got a NT4 connection from 10.70.28.185 port 58798
NT: CONNECTED NT4 client 'shuffleboard@4' (from 10.70.28.185:58798)
NT: DISCONNECTED NT4 client 'shuffleboard@4' (from 10.70.28.185:58798): remote end closed connection
NT: NT4 socket error: operation canceled
NT: Got a NT4 connection from 10.70.28.185 port 58800
NT: CONNECTED NT4 client 'shuffleboard@4' (from 10.70.28.185:58800)
NT: DISCONNECTED NT4 client 'shuffleboard@4' (from 10.70.28.185:58800): remote end closed connection
NT: NT4 socket error: operation canceled
NT: Got a NT4 connection from 10.70.28.185 port 58801
NT: CONNECTED NT4 client 'shuffleboard@4' (from 10.70.28.185:58801)
NT: DISCONNECTED NT4 client 'shuffleboard@4' (from 10.70.28.185:58801): remote end closed connection
NT: NT4 socket error: operation canceled

How many topics are you publishing and with what data in them? And just to reconfirm you’re running 2023.4.2 on both client and server? The cause of the reconnect loop is that there’s a 1 second timeout on the server side when it tries to send data over the network and can’t (because the transmit buffer is full). There’s an initial burst of traffic when the dashboard connects as it subscribes to all topics and gets the initial set of values. For this to take longer than 1 second, either the network has to be really bad or you have a tremendous amount of data to send.

Actually… this is the client side initiating the disconnect. Hmm. Typically that is only due to the client trying multiple server addresses and disconnecting from the ones it doesn’t need after it successfully connects to one. If you change the NT server address setting in Shuffleboard to the robot IP (rather than a team number) does it fix the reconnect loop?