2023 NT4 Dashboard/Telemetry Update Feedback Thread

With the scheduled introduction of NT4 (the next version of NetworkTables, the FRC telemetry bus) in 2023, I’m trying to coordinate a matching rework of the robot-side telemetry binding code (Sendable) and a generalization of the Shuffleboard metadata format for consumption by other dashboards (eventually, we would like to move towards adopting FRC Web Components as the official dashboard). Additionally, there is some desire to support Oblog-esque annotations for binding telemetry data.

I’ve written a design document that outlines the current state of WPILib telemetry, describes the desired future directions, and describes loose specs for the features mentioned above. It’s at a point now where I feel comfortable presenting it to the community for feedback, as I did a few years ago with the command-based rewrite.

Feel free to comment here with your thoughts, criticisms, or suggestions. Even better - submit a comment on the document pull request. Remember, WPILib is a community project - the more interaction we get in the design process, the better we can address the actual needs of the community. The command-based rewrite went through substantial changes between the publishing of the design doc and the final implementation. If you have a concern, don’t hold it back!

P.S.: If any experienced students or mentors would like to volunteer to help implement any of this, let me know.

P.P.S.: Almost all vendor products implement Sendable, so deprecating it in favor of the TelemetryNode class described in the document is an (eventual) burden on vendor development. I don’t think the interfaces are different enough to make this port particularly painful (especially given how innately-painful Sendable already is), but feedback from people who actually maintain vendor code would be helpful!

7 Likes

This looks awesome. I love the transparent collaborative approach you’re taking with these huge efforts.

I’d love to contribute, though not sure how much time I’ll have to give or where I can best assist. My professional background is strongly on the Java web app side of things, but I’m fairly well-versed in wpilib as well as a CSA (though my C++ is sorely lacking)

1 Like

Curious about the NT4 protocol, why a new protocol vs using existing protocols like MQTT or Sparkplug?

2 Likes

@Peter_Johnson is most-qualified to answer this, but off the top of my head: it has to interop with our on-RIO logging implementation, which is already shaped around NT, and all the previous tooling has been written around NT so preserving the existing idioms in at least some capacity is good for continuity.

WPILib’s in a kind of unique spot where our solutions have to be performant enough to work over fairly tight bandwidth limits and on a fairly idiosyncratic control system, but also approachable enough that students don’t find the entire thing confusing and useless.

2 Likes

As a self proclaimed connoisseur of web based dashboards I have a plethora of thoughts and ideas to contribute. I’m afraid I have little to know availability for the actual software dev, but admittedly that’s probably a good thing, since I write web software like it’s 1999.

Will comment on GitHub with thoughts. Also you know where to find me.

3 Likes

Bandwidth limit only exists from the driver station to the robot right? We have vision compressors, why not a telemetry coprocessor that can host a simple TSDB+web dashboard

The limiter is on the radio so anything on that network is being limited including the coprocessors.

The bandwidth limit it actually over the WiFi. So wired devices on your robot can all talk without limits, but any of those devices trying to talk over WiFi to the DS or DS to anything on the Robot will be limited.

1 Like

Unless this was provided in the KOP, I don’t like the idea of now requiring extra hardware to use a feature that previously worked fine without it. Especially now when acquiring said extra hardware requires 24/7 attention directed at a stock checking website or shelling out a bunch at a third party seller.

1 Like

Just like a vision coprocessor isn’t required to do vision, this wouldn’t be required to do get telemetry.

It’s only a potential optimization for teams that wish to do it this way.

There are a lot of raspberry pi alternatives to be explored.

I’ve found a Wyse/Dell N3040 for $35. Photonvision is a little slow on it.

I am checking other alternatives. Plenty of mini PCs on Amazon that could probably run a bunch of things in one box.

1 Like

This is a step in the right direction and I really like the meta data that provides information about what the data is. It seems like the robot side should describe what the data is, but be careful not to try too hard to describe how the data should be viewed. Dashboards may come and go and some teams have custom dashboards based on reading data from the network tables. The way data is viewed really is a function of both the characteristics of the data (as supplied by the robot in the meta data) and the capabilities of the consumer of the data which vary. The choice to display a floating point value as either a number, a bar, or a graph indicating the data over time, should be a dashboard level choice. However, this layout information is very often specific to the given robot. This include both how data is viewed, and the organization of the data (into tabs for instance).

It feels like a good answer would be to have the dashboard generate view and layout information (JSON) that gets copied into the robot code and deployed to the robot, and then gets pushed from the robot to the dashboard during robot initialization. It still requires the team to get the dashboard like they want it, and then save the layout information into the deploy directory of the robot. The robot code could then load the layout information and push to the dash board during robot initialization. Just an idea to keep the description of the data separate from the description of how to view the data.

One thing that is not clear from the design document, is what are the performance targets for the telemetry data? Is this design more about the framework for managing the data and you will get what you get from the network tables? I really have never tried to characterize the network tables, but if you are running a robot with a 20ms loop time, how many data elements can you publish per robot loop before you start losing data? Just curious if there are targets for performance (e.g. we can log 2048 unique double values per 20 ms robot loop without any degradation of the driver station performance or without losing data).

Additional capabilities that we have in our home grown solution is the ability to publish specific data only when the robot is disabled. This is useful in the pits for system checks between matches. It prevents the data from being pushed to the network tables during a match consuming bandwidth, but allows this data to be seen in the pits. As another slight modification from this theme, we have some detailed data that we never send when we are connected to the FMS as it is intended to be only used at our home field while debugging specific problems. Having an FTA as a lead mentor, we hear all of the horror stories of bandwidth problems on the field. TO this end, we are probably more paranoid about this than we need to be.

Finally, have you thought about meta data to describe windows of time. There are times on the robot where specific events can be detected, and you want to look at a plot from 2 seconds before the event till 2 seconds after the event. Or there are specific actions with a begin and end time and you want to look at the data in this window. It would be nice to have a clean way to log into the meta data named time values (either points in time or ranges of time) that could be (optionally) used by dashboard to quickly look at data at specific times.

Again, this is great work and please take my comments in the spirit of wanting to junk our home grown solution and use this superior solution as soon as it is available.

Thanks

1 Like

This is actually exactly what is described in the “dashboard metadata” section - a dashboard-neutral JSON that can be interpreted/consumed by whichever dashboard implementation teams decide to use. This is not necessarily tied to the NT data tree - I agree completely about separating view preferences from inherent metadata about the data.

1 Like

FRC has somewhat unique requirements that COTS pub/sub frameworks don’t support in a good way. Conversely, a lot of COTS pub/sub frameworks have high complexity to support features that aren’t useful in FRC. The core design criteria we’re looking for in a NT upgrade are:

  • Protect the field wireless network via rate limiting / update combining (avoid large quantities of small packets) (existing NT)
  • Robustness through Rio reboots / redeploy of user code (existing NT)
  • Robustness through client disconnects / reconnects, dynamic network configs/clients (existing NT)
  • Ease of use for inexperienced programmers (existing NT)
  • Persistent (saved across server reboots) and retained (saved across client disconnects) values (existing NT)
  • Ports configurable so it works with radio firewall (current NT)
  • Timestamped data changes and time synchronization
  • Support sending all data value changes (current NT only sends the latest change)
  • Pub/sub operation (only send value updates to subscribers to that value)
  • Compatible with web technologies (can be used directly from JavaScript in a web browser)
  • Metadata support for additional value/display information (existing NT only has a few flag bits)
  • Easy to compile/use on coprocessors and desktop sim environments (e.g. can run in-process, without separate services that need to be stopped/started)

MQTT is probably the closest to what we’re looking for, but it has a 137-page specification, supports 3 levels of QoS, has 15 message types, 27 standard properties, has full-custom wire encoding (it can be wrapped in WebSockets, but that’s just serving as a transport layer for the raw encoding), and requires a separate broker process, but does not have built-in support for timestamping or time synchronization, persistent values, rate limiting/update combining, etc. For comparison, NT4’s spec is <25 pages, uses JSON, MessagePack, and WebSockets for wire encoding, and has only 8 message types.

NT4’s wire layer is fairly efficient, a timestamped double is usually 17 bytes (8 byte value, 4 byte timestamp, plus topic ID and encoding overhead). Given that, 2048 doubles at 20 ms rate would consume 13.9 Mb/s, significantly more than the radio allocated bandwidth. That’s the reason that DataLog for on-robot telemetry logging is a thing. However, the only reason you actually need data at that high of a rate is for plotting or offline analysis, which is why NT4 (unlike NT3) supports dashboards subscribing to individual values at a higher rate, rather than just having a global update rate for all values.

3 Likes

Is there a design document (or something talking about the implementation) for NT4? I would love to read more and understand what is coming? I apologize if I missed it in previous threads. I tried looking around and did not find it.

Thanks

Still making some tweaks, but here’s the draft NT4 spec (note: this link will die after it’s merged to the main repo as part of allwpilib#3217, which is the implementation PR).

2 Likes

I read through this and have a couple of questions or clarifications. If a client publishes a topic, it provides a publish id (pubuid) which is used to unpublish. It states that this ID is also used for MessagePack messages. When the server processes this publish method, it will send an announce message (even to the original publisher) which contains a topicID (id). For the client connection that originally published the topic, is this topicID the same as the pubUID? For other clients, it seems like the flow is they see the topic in an announce method and stash away the mapping between topic name and topicID. They can then subscribe to an array of specific topic prefixes which will cover zero or more topics. This then causers the MessagePack data to start for those covered topics, with the topicID seen in the announce message as the indicator of the topic data belongs to.

So, what I am looking to clarify is that since the clients are independent and pubIDs can be duplicated across clients, does the server keep track of the connection and the pubID to send MessagePack data? And does the announce message for that client have the topicID set to the pubID? Or is the pubuid used when the client sends new data to the server for the topic, but if the server or another client updates the topic, the received data comes from the topicID in the announce message?

Second when you describe the data structures that make up the message, you describe them as an array of maps. I think of a map in terms of a std::map, mapping keys to data objects. I think this really just feels like an array of data structures. For instance, the $clients is just an array of pairs? With each pair having and “id” and “conn” field? Am I missing something here? Just to be sure I completely understand.

This looks great and thanks for the efforts you put in on this. Hope my questions are not too much bother.

No. The pubUID is used by the client when sending MessagePack messages to the server (for topics it has published). The topicID is used by the server when sending MessagePack messages to the client. They are not the same (except by happenstance).

Yes, exactly.

Map in this document is a map of strings to values, aka a JSON object. Originally I used the name “object” for this in the spec (which is what JSON calls it), but others found this terminology confusing. $clients is an array of maps, and each map in the array has “id” and “conn” keys associated to the corresponding values for that information. This is pretty typical for JSON arrays of objects.

1 Like

Regardless of the higher-level protocol, shouldn’t Nagle handle this in the network stack?

No. We turn off Nagle (TCP_NODELAY) because we care about latency of updates, particularly from coprocessors. UDP introduces a whole slew of different challenges. Maybe in the future we could consider QUIC, but it’s not well supported in browsers yet (webtransport). We also looked at WebRTC to support peer-to-peer, but it doesn’t have good port control, so would be a pain for users.

1 Like

QUIC is superseded by HTTP/3 and is supported by the majority of browsers.

Have you looked into COAP?

https://coap.technology/spec.html

Or ZeroMQ?

Is browser support a P0 or P1?