Introducing: Logging with AdvantageKit

One of 6328’s key projects this offseason was to develop our own Java framework for data logging in robot code. We created AdvantageKit, a library that we plan to expand with a variety of useful tools for building robot code.

Lots of logging frameworks already exist in FRC, from built-in tools like the Driver Station and Network Tables to custom solutions for logging more detailed information. However, most systems exist around the same fundamental idea; record a limited set of values explicitly provided by the code. That could include sensor data, PID error, odometry pose, internal state, etc. While this has enormous value, it doesn’t solve the eternal sentiment when something goes wrong: “If only we were logging one extra field!”

Logging in AdvantageKit is built around a different principle, often used in industry. Instead of logging specific output values from the user code, it records all of the data flowing into the robot code. Every sensor value, joystick press, Network Tables update, and much more is logged every loop cycle. After a match, these values can be replayed to the robot code in a simulator. Since every input command is the same, all of the internal logic of the code is replayed exactly. This allows us to log extra fields after the fact, or add breakpoints and inspect the code line by line. This technique means that logging is more than just a tool for checking on specific issues; it’s also a safety net that can be used to verify how any part of the code functions.

This certainly isn’t a new idea in FRC, but one of our key goals was to find ways that we could support this high level of logging in a way that might be accessible to more teams. That means reducing the changes required to user code as much as possible, and building on top of existing frameworks (like command-based WPILib). Data is recorded and replayed in the background wherever possible. For inputs not built in to WPILib, hardware interaction in each subsystem is separated into an “IO layer” with an interface and one or more implementations. This structure allows the logging framework to inject input data in the simulator while using a default implementation that doesn’t require real hardware. As a bonus, it also makes it simple to replace the “real” hardware implementation with a physics simulator.

We created a custom viewer application called Advantage Scope, which reads data saved by the robot code in our custom binary format. The app includes visualization tools like line graphs and field maps, plus a few extra tricks; it can download log files from a RIO over SFTP, view data live over the network, or convert the data to a CSV for other tools.

As part of 6328’s efforts with the #openalliance, all of our code is open source and available on GitHub. Most of the critical components already include documentation, and we’re working on filling in the gaps. Make sure to also keep an eye out for 6328’s 2022 build thread, where we’ll be sure to give updates on our progress with AdvantageKit (and lots of other software topics). We hope this work proves helpful (or at least interesting) to other teams thinking about options for data logging in robot code.

Useful Links

35 Likes

This looks awesome, great work!

I’m curious if you have any benchmarks showing impacts to the processing load. I presume this logs on the Rio directly, so there’s guaranteed to be some amount of impact, especially since file I/O isn’t exactly known for being speedy.

1 Like

It’s tough to get an exact estimate, as the performance overhead is largely dependent on the amount of data being logged. Based on our testing with a “reasonable” project, logging adds about 2-3ms per loop cycle. As much work as possible is performed in threads (including reading DS data, encoding data to the binary format, and file IO), which are timed to usually run while the main thread is asleep between cycles. The biggest performance hit is currently reading data from Network Tables (which is often ~1ms per cycle on its own). We’re looking at potential ways to cut down that time, but an easy workaround is to log only essential subtables/fields. That has the potential to bring the overhead closer to 1-2ms.

Of course, the RIO 2 can also provide a significant benefit here; we’ve seen a ~20-30% decrease in all of these times based on our testing. However, we consider that improvement more of a “bonus”, and plan to fully support logging on the RIO 1.

3 Likes

Nice work! FYI, I’ve been working recently on adding similar functionality to WPILib (the logging side of it, not the replay back into robot code).

10 Likes

This has some great stuff in it. We have been fiddling around with data-centric frameworks for a couple of years now. Robot telemetry analysis enables a whole mess of non-robot data-centric mentoring and exploration for robotics teams with many programmers (and graphics design students who also like data). So it’s something that can have a long-lasting impact using real data tools like Alteryx, Data Aiku, Tableau (if all you care about is visualization), SAP, etc.

One way to significantly reduce overhead is to embed a data layer directly into the robot’s logic*, and then save that data layer every time. I want to say that our average 2020 CPU usage was 60% during a live match. It takes a bit of explaining to get students on board with it, but it’s very worth it. We dump this to CSV and to NetworkTables. To simplify this, we use a Java enum-backed framework pretty extensively.

Overall, the biggest issue we’ve had with the “log everything, every time” concept is CAN bandwidth. We would ideally capture voltages, currents, and thermals for every BLDC on every cycle, but we’d max out the CAN usage pretty quickly by doing that. So we still have to choose.

Your UI looks pretty amazing. We will probably figure out how to hook what we do into that UI, so stay tuned. WPILib Glass has been great so far, but some of the additional graph layering could be pretty helpful.

IMO, one of the ‘next’ frontiers in a logging UI is to take a timestamped ‘message’ that was put on a data stream like System.out, and create a cursor across all graphs when the message is hovered over. This would help visually associate any edge-case error messages with the raw data coming over at that time.

Cheers!

* Side Note

This data framework came from some literature I read about the Spirit & Opportunity rovers’ architectures. It allows NASA engineers to simulate everything about the rovers before updating their code. We starting fiddling with unit tests pretty extensively at the start of 2020, and will improve upon them in 2022.

4 Likes

It’s fascinating to see all the various approaches to this similar set of problems. Fundamentally, the concept of separating hardware interaction into a separate layer is very similar to the structure we’re using. I’m intrigued by the idea of using enums to define all of the interaction with that layer (including output commands) - it’s a very clean way to structure a system like this.

Our approach more closely matches the way fields are defined in NT, where everything relies on string keys and slashes to denote subtables. There’s certainly some performance hit with a system like that, but overall we felt that the easy of use (particularly to newer students) was worth the tradeoff. The same thing applies to our use of the command-based framework; we felt that building logging as an extension of existing structures would make it easier to get students up to speed.

As for visualization, Advantage Scope is currently built around our binary format for logs but could probably be adapted to read a CSV if necessary (or another tool could convert between formats). Keep us updated, and we’re happy to help where we can.

Better management of “messages” like stdout is an area we might explore in the future; right now, our plan is to use the built in DS logs along with replay where necessary. Advantage Scope already supports sharing the selected and hovered timestamps between visualizations (like showing the odometry position while hovering over a line graph), so with the right data available your suggestion is probably pretty feasible. I’m definitely a fan of that idea.

Great work!

6 Likes

This rules. The replay stuff is super neat. Thanks for this!

3 Likes

Important PSA for AdvantageKit users!

We’ve identified a critical issue in AdvantageKit v1.2.0 and earlier which sometimes causes the main robot program to hang. This has been fixed in the latest version. Please update as soon as possible; there are no breaking changes in the latest update as of this writing.

The easiest way to update is to delete the AdvantageKit.json file from the vendordeps folder, then reinstall the library with the URL below:

https://github.com/Mechanical-Advantage/AdvantageKit/releases/latest/download/AdvantageKit.json
4 Likes

Another important PSA for AdvantageKit users!

Downloading AdvantageKit from GitHub requires an access token per this issue. The installation instructions included a token for this purpose. Unfortunately, GitHub’s bots will revoke the token if it’s found in any commit (this is why the installation instructions included it as two concatenated strings). The current token was committed as plain text and has been revoked, meaning you will no longer be able to build a project using AdvantageKit once Gradle’s local cache expires.

The fix is simple; we’ve generated a new token, so just follow the instructions here to update build.gradle with the new value:

repositories {
    maven {
        url = uri("https://maven.pkg.github.com/Mechanical-Advantage/AdvantageKit")
        credentials {
            username = "Mechanical-Advantage-Bot"
            password = "\u0067\u0068\u0070\u005f\u006e\u0056\u0051\u006a\u0055\u004f\u004c\u0061\u0079\u0066\u006e\u0078\u006e\u0037\u0051\u0049\u0054\u0042\u0032\u004c\u004a\u006d\u0055\u0070\u0073\u0031\u006d\u0037\u004c\u005a\u0030\u0076\u0062\u0070\u0063\u0051"
        }
    }
}

This time, we’ve gone the extra step of obfuscating the string itself rather than just splitting it in two. To be clear, DO NOT INCLUDE THE PLAIN TEXT TOKEN IN ANY GITHUB REPOSITORY.

Another reminder — please see my previous post about a critical issue in AdvantageKit v1.2.0 and earlier which can cause the main robot program to hang. Please update ASAP if you have not done so already.

Thank you to everyone who has been using AdvantageKit. As always, we’re happy to answer any questions.

3 Likes

Hey Jonah,

We just implemented AdvantageKit in our codebase this week, and we have found the logging capabilities very useful for debugging issues. Unfortunately, today, we found that our NavX returns no data when our code is built on top of AdvantageKit.

Initially we thought this was a hardware issue, but when we run our old (pre-AdvantageKit) code the gyro returns data normally.

We also moved our gyro code to Robot.kt to make sure nothing weird was happening with the IO classes. This still caused the gyro to return no data and return isConnected as false.

Also, everything else in our IO/IOReal classes works perfectly fine (shooter spins up, intake works, driving works except for field oriented stuff because the gyro doesn’t work).

We noticed that 6238 uses the NavX so I’m not really sure what could be going on, but the same gyro code that worked pre-AdvantageKit is not working for us anymore so :man_shrugging: We are trying to compare your code to ours but so far we have not found any differences in how we are implementing the gyro.

Do you have any further insights?

Our code is here: https://github.com/team4099/RapidReact-2022/tree/newauto
A log file for this behavior is here: https://drive.google.com/file/d/1415iSx5aKMKKqbIq2oyoH-2bwi4Af7xo/view?usp=sharing

I was able to recreate the issue by running your code on our robot (which has a navX connected in the same way). It seems like this is a Kotlin vs. Java issue, which is why we didn’t catch it during our testing. I’m not an expert on Kotlin, but it appears that RobotContainer and all of the subsystems (including the navX) are being instantiated before robotInit is run. In our code, RobotContainer is instantiated after AdvantageKit is set up. Once the AHRS object is created, the navX thread makes repeated calls to retrieve the FPGA timestamp. Ultimately, this goes through our WPILib shims and the timestamp is provided by AdvantageKit. While it properly handles those method calls before and after running logger.start(), there’s an edge case during initialization that throws a NullPointException.

I’ve released AdvantageKit v1.7.2, which should fix the issue. Let me know if that works for you.

3 Likes

Thank you so much! We will try the new version of AdvantageKit tomorrow and get back to you, though I imagine if you tested our code on your robot and fixed the issue that way there’s no reason it wouldn’t work for us.

For reference, your understanding of how Kotlin handled the creation of our RobotContainer matches mine: Object expressions and declarations | Kotlin

Essentially Kotlin’s equivalent to Java’s singleton classes is object declarations, and that is how we defined our RobotContainer. These declared objects are instatiated when first referenced – which for us was in the constructor for Robot. There’s no reason for this to happen in the constructor, so we have also moved that instantiation to after the logger starts in robotInit.

Thanks again for your help and going above and beyond in debugging the issue, this will surely save us a ton of time.

Hey Jonah,

Team 2239 implemented AdvantageKit in our code earlier this year, and we’ve been loving the functionality it has given us so far. Through our first regional a month ago the framework has been working great for us. At our regional last week, however, our robot displayed the following issue in some of our qualification matches:

  • Autonomous would run exactly as it should
  • For the entirety of Teleop, our robot would not function at all (it was still enabled)

We attempted to replicate the issue in the pit many times by running the robot in practice mode, but our teleop code always worked fine in the pit and has never been an issue outside of this competition. Because the issue only occurred in real matches, we were changing a few things at a time to try as hard as we could to fix the issue, and it is hard to know exactly which change made everything work on the field. We are fairly sure, however, that adding our default drive command as an explicit function call in teleopPeriodic() did the trick (the only other changes, such as using a new Driver Station laptop, were not in the codebase). Somehow, this seemed to allow commands for all subsystems to run, not just the drivetrain.

Other complicating factors include that we were having significant issues with loop overrun and high CPU usage on the RoboRIO. We think that this issue is unlikely to have caused the problem, however, because these did not noticeably change between auto and teleop according to DriverStation logs and we don’t think these would cause a failure at such a specific point in the match (the autonomous-teleop transition).

Also, we are currently still using AdvantageKit v1.4.0. I now regret not updating this before our regional, but it seemed like a risky move when we had to spend a real qualification match testing every change we made.

Because we want to be sure that we have fixed the issue before our next competition, we are looking into ways to trick the robot code into thinking the FMS is connected in an attempt to replicate the issue off the field for easier debugging. The easiest way I can think of doing this is forking AdvantageKit and changing the DriverStation class to always report FMS connectivity and a match number. It seems, however, that the conduit component of AdvantageKit interacts with this data on a lower level before feeding it to the DriverStation class, so I’m curious whether we would need to change that.

Here is our robot code: Team 2239 2022 Robot Code

The code that we used successfully in the later qualification matches and in playoffs is in the branch testadvantagekit. The code that resulted in the issue can be achieved by commenting out this line from teleopPeriodic() on that same branch.

In the end, we are happy that we chanced on this fix and we were lucky to make it to playoffs and perform very well. We are interested in your help, however, in determining any way that we may be incorrectly using AdvantageKit that could cause this issue and helping to devise a method to replicate this issue off-field by faking an FMS connection. Thank you for putting so much work into this library because it has been a big help for us this season and has sped up development of many critical features.

1 Like

@jonahb55 Henry doesn’t know how to use Chief Delphi, so he forgot to @ you. Sorry!

Hi,
Looking through your code, I don’t see any immediate issues. I would definitely recommend updating AdvantageKit (the latest version is currently v1.7.3), since there are no breaking changes. I think it’s unlikely that AdvantageKit is the root cause of these issues, but I’ll address your concerns as best I can:

Does your robot use a CTRE PDP or a REV PDH? By default, AdvantageKit looks for the CTRE PDP on the default CAN ID, and it seems like it can cause loop overruns if it doesn’t connect properly. On the latest version, you can add a call like this to set the power distribution type and CAN ID.

You’re correct that AdvantageKit uses conduit to pull the DS data from a lower level (directly from the HAL). I think the easiest way to trick it into thinking you’re connected to an FMS is adding a call like this at the top of robotPeriodic:

LoggedDriverStation.getInstance().getDSData().fmsAttached = true;

This means that subsequent calls to DriverStation.isFMSAttached() will return true, though the data won’t be written to the log file. Feel free to give this a try, but it also doesn’t look like you’re calling this method anywhere in your code. The only other difference that I imagine the FMS could effect is if the timing of enabling and disabling is subtly different that the Driver Station’s practice mode.

Do you have any log files from the times this issue has occurred during matches? They might be useful to see if anything else odd was happening.

4 Likes