During (seemingly) random competition qualification matches (I.e Qualification Match 72), our robot would basically lose connection from our driver station, and whatever action was made at the moment of the disconnect/crash would continue to run which is shown in qual match 72 Q76 2023 Pacific Northwest FIRST District Championship WIDE 50 - YouTube near the end of the match according to our ds logs and our drive team, the robot disconnected and continued doing the movement it was given the moment it crashed or disconnected.
Just as a note, on our robot we log with the ds and with a tool called AdvantageKit and use AdvantageScope to view our ak logs, whenever the robot seemingly crashes, the AdvantageKit logs stop where it crashes and the ds logs continue, which is regular and pretty much called for.
We looked at our driver station logs after the match, and even after matches where we had issues, and the one thing that was obviously consistent on each game log was the fact that near the part where we lose connection and cannot control the robot, our TeleOP mode stops, while another mode that I forgot what the name was (and I’ll reply with an update tomorrow once I get to the pits). However, we still maintain a connection with something while we lost robot control.
We’re currently working with the Control System Advisors to try and troubleshoot our robot, and the one thing we mainly suspect is the culprit is our code - something is happening that’s either crashing the robot when we do something or (I don’t know how to correctly explain it) similar to some computer crashes where the screen freezes but you still hear audio and it sounds long and horrible and the only way to fix it is restarting the computer (not exactly the best analogy but you maybe get the idea).
We looked at the travel times for our robot’s subsystems and one of the things that are taking a little too long is our Arm subsystem which takes around 15ms to maybe 30ms, in which the robot would throw a Loop Overrun error every couple of seconds at the least, and on that note of loop overruns, whenever our robot disconnects from our ds and we can no longer control it, we no longer receive any loop overrun errors in the logs, which is probably self-explanatory as the code either stop or we can no longer receive any errors.
As of right now, I don’t have access to our driver station logs as it’s in the pits, but as soon as I’m able to get access to it I can reply to this thread with our match log files where we had issues and a sample log of a good match.
And here’s our (somewhat) updated robot code on Github GitHub - 2BDetermined-7034/2023-Charged-Up at Final_Robot_Optim
Did this happen in Q72 (as you stated), or Q76 (as the linked match/video showed)? Assuming Q76 (which your team is in), was it the pirouette and overextension of the arm with ~15-20s left?
I’d definitely be interested in seeing the logs. Can you upload the AdvantageKit .wpilog file as well as the DS logs (both .dslog and .dsevents files)?
If the CSA needs further assistance, they can ask on slack. Just have them mention this thread so we don’t duplicate efforts any more than necessary.
Rio1s are consistently OOMing this year and causing crashes. Multiple teams here in NE have had to buy a new rio2 because they can’t guarantee a rio1 will survive a whole match. Multiple teams (including my own!) have OOMed a rio1 mid match.
I would expect a OOM situation to crash with a std::bad_alloc or OutOfMemoryError, not “freeze” and keep running the way OP described (though I can’t say exactly what happened without seeing logs)
Are you guys using multithreading? We had this exact issue multiple times last year (see this at dcmp q56 at around 1:40 Q58 2022 Pacific Northwest FIRST District Championship WIDE - YouTube) and didn’t get picked as a result. The robot just executes the latest command and drives into the wall while still being connected to fms and we were only able to disable it through the estop.
We figured out later in the offseason that it was due to thread deadlocking in our intake subsystem (it would happen when we spammed the intake button too hard). I don’t really know what we did to fix it, but I could get my prog lead to describe it to me if that sounds like the problem you guys are having rn.
It’s currently a Rio 1, though I’m told it’s running lower end hardware so it takes a bit for processing and the can network was new technology at the time so it’s finicky.
We went through trying to spam our arm buttons to see if we could recreate the issue, but with the time given at the end of the competition we were told to leave the pits for the day, we may be able to try again at spamming our buttons to see if they does anything
The CSA also told us we were using way too much bandwidth at day 3 so we turned off our camera and limelight streams but it didn’t appear to be the issue. Might be good to check how much you guys are sending over fms though.
And something I forgot to add, is the fact we have around 16 devices on one can network, 4 ctre CANcoders, 8 Spark Maxes for our swerve, two sparks for our indexer, two more for our intake, and I think around two more for our arm joints, and during matches and even idle, we’re averaging around 60-90% CAN utilization during the entire match.
They didn’t mention anything about our bandwidth, we’re running our camera at minimal settings (10fps, and some resolution around 200 by 300) taking about 0.4mpbs for the fms.
And something to mention is that we disabled literally everything in our robotcontainer and the camera took up 10% of our rio cpu which is somewhat normal.
I don’t know if CAN util is the problem since we were maxing it before optimizing with 12 falcons + 2 talons + 4 CANcoders so you should be fine on that front
Are you guys seeing any spikes in CPU usage? We had some high cpu util when we are idling (~60%) and we spiked to like 90 something randomly. It’s probably not going to be the problems either but something to check
If I remember correctly we hover around 70 during matches and maybe drop down lower and I don’t know where, but we pretty much pegged the cpu at 100 when we crashed, though don’t quote me on that since I’ll post our logs when I get the chance
Man that sounds tough and we really feel for yall ): What we did before day 3 that seemed to have fixed the issue was 1. reimage roborio; 2. replace ethernet cable; 3. replace radio; 4. be really conservative on driving. Those precautions may have not done anything, but I’ll just put it out there as a last ditch attempt if you guys can’t figure anything out.
If you guys need anything else, please come by our pits tomorrow at any time. Besides that, there isn’t much else I can really add, but I really hope you guys get it fixed and wish y’all the best of luck tomorrow at quals/playoffs! I
This was almost certainly at the behest of the FTA on the field. The radios will limit you (ish) to the 4mbps the rules allow, so theres no way to use “way too much”. That said, multiple teams using 3-4mbps in the same match, especially in a moderately congested wifi environment (venue wifi, team/spectator hotspots, etc), can affect the field and other robots.
So just in case the reasoning wasn’t fully explained to y’all, that’s likely the core problem there.
I definitely see some things that can be optimized as far as memory allocations. For example, in the Arm Subsystem in your updateLogging method, I see 5 separate calls to getCurrentState(), each of which creates a temporary ArmState object with 6 doubles (and also does some calculations). There’s 2 more in the periodic method. For those, you can easily call getCurrentState once in periodic and save that object and pass it to the updateLogging method to reuse also. I’d recommend going through your entire codebase and look at each place you call new and see how you can reduce the number of places and times new objects are allocated.
You’re using WPILib 2023.4.3, but using AdvantageKit 2.2.1 which was based on WPILib 2023.4.2. I know that AdvantageKit changes WPILib internals, so I think they need to be matched. You should update to AdvantageKit 2.2.4.
I’d also check to make sure any of the CAN calls are not blocking. Looking at the SparkMax documentation for periodic frames, you can see what data is reported periodically. Some of the data you’re requesting isn’t in there. Since all the implementation details are hidden, I can’t tell if the other data is calculated on the roboRIO without blocking, or if there’s data being requested from the SparkMax that may block until it’s returned.