Today I’ve been trying to debug a very weird error I’ve been getting. Sometimes when I deploy or restart robot code, nothing in the CTRE library will work and it will give me: ERROR -200 CTR: No new response to update signal ....
So today I’ve tried to reproduce the problems and am able to easily do it. I found that every second restart, it would just work. (It would work the restart after it stopped working). I’m still trying to figure out why my robot code isn’t shutting down correctly (I can’t reproduce this on an empty project) and I will post back here once I figure out why my robot code has to be forced to shut down.
Here’s an example of what the console shows:
* Program running perfectly*
Killing previously running FRC program...
FRC pid XXXX did not die within 500ms. Force killing with kill -9
********* Robot program starting **********
*CTR errors*
then when we restart:
*CTR errors from previous program instance*
[phoenix] Library shutdown cleanly
[phoenix-diagnostics] Server shutdown cleanly. (dur:110|0)
********* Robot program starting **********
*No errors*
Is there a way to work around my robot code not shutting down within 500ms? When the code is running, it works perfectly, but when we need to restart it, we have to restart it twice for the CTRE library to work.
Once I have more information as to why the robot code isn’t shutting down normally, I’ll post back here. I’m currently using a different project set up than most teams do and I want to be sure that the code not shutting down isn’t my fault. It could be possible that the Phoenix Library itself isn’t shutting down in 500ms, but I still have some additional debugging to do before I can confirm that.
@ozrien I know you said to email support, and I might do that later today if I don’t have luck here.
Sometimes after deploying the robot code, it can go a few "restart robot code"s without erroring out, but once it does, the robot code that doesn’t error out will always be killed with -9 and the next instance won’t work because phoenix isn’t deinitialized.
I think that my code might be doing something that makes it so it isn’t ending correctly (when it has to kill it with -9), but when CTRE doesn’t deinitialize correctly, isn’t that a bug that can be fixed by you guys?
I really appreciate you guys looking into this. I can agree that our project is unique, but I thought I was doing everything on the main thread. I’ll check that later to make sure it actually is because although our project is unique, I try to make it somewhat similar to a standard project so stuff like this doesn’t happen.
Omar suspected threading based on the use of Runnable in several places, though I’m not sure you’re actually starting those or just manually calling run() - we didn’t dig that deep into the project.
In any case, the issue was still reproduceable after we removed Phoenix from the project completely. Every time your project is deployed your running project fails to end and gradleRIO has to do a kill -9 to end the process.
Very repeatably, if the deployed project is fully functioning the next deploy results in CAN being non-functional. Even a generic CAN send will fail in this case (with or without Phoenix in the project).
After digging deeper, it looks like in these cases something is preventing the roboRIO’s CAN interface from being loaded properly by WPILib. It’s likely related to your project needing to be killed each time, but I can’t say for certain - that’s a question for the NI/WPILib folks.
Though this isn’t specifically a Phoenix issue, we’ve discussed some ideas on how to detect this state and potentially work around it. I can’t say if/when/how this might happen, however, so the root issue still needs to be addressed.
I advise working on getting your project into a state where it doesn’t need to be forcibly killed for a deploy. The layers of abstraction you have for simulation seem somewhat non-standard, so I might start there. Make a blank robot project using the standard WPILib template and start bringing in sections of your code one at a time (without using separate libraries like your current setup). At some point the problem will reproduce (which should identify what in your code is causing the issue) or it won’t reproduce at all which will point to something in your build/setup that’s causing the issue.
Thank you for all the testing you guys did. I didn’t realize that this was something that was affecting the entire CAN bus, the only motor controllers we use are CTRE . I’ll keep debugging the problem with the steps you suggested and I’ll hopefully find the source.
One of the big things we’ve seen cause this is if you are catching and ignoring thread interrupted exceptions. If you do that, your old program won’t exit, and the entire Hal will not initialize correctly, which causes the can initialization to fail as well.
This probably isn’t the best thing to compare it to, but I’ve tried lots of times to see if I can reproduce this in the simulateJava with no luck, so it’s not like my program is not ending. One thing that could have an affect is that when I and the CTRE devs tested the code, we didn’t have anything plugged into the I2C port. I assume that the I2C has some sort of timeout if it can’t get any messages through. I doubt it block for over 500ms, but I will try disabling it tonight when I have a roboRIO to test with.
While we do plan to use one additional thread, it will be a daemon thread and I had it disabled while testing. So I don’t think it had anything to do with threading as our project still uses the default WPI initialization stuff (with RobotBase.startRobot())
If I can’t figure this out, the next thing I’m gonna try is purposefully crashing the robot code if the CAN doesn’t initialize correctly. I just have to figure out the best way to detect the CAN not initializing correctly.
Hopefully the problem is reproduced when I use the default project structure later tonight.
After removing calls to an unconnected I2C device (our gyro), the program hasn’t had to be killed forcefully. I haven’t done any further debugging, so I don’t really have a guess to why this is happening.
I’ll post more updates if this problem arises again when we have our gyro connected.
Update: A few days ago when we were getting ready for competition (RIP competitions right now), this problem came up a few times after deploying robot code. So this is not solved. I’m not really not sure why our robot program is taking so long to end, especially because our program is very responsive. I hope this is something that is fixed next year. I’m also surprised no other team has had this problem. I guess if it’s just us then maybe it isn’t something that needs to be fixed.