![]() |
100% CPU usage and double timeout bug
Hello Everyone!
I'm the programming lead for the TechnoKats, and we've been having some interesting issues/results that we'd like to share and see if anyone else has been having similar issues. Today, after our practice match, we took to the practice field where we had large delays and weird results from our PID loops. This included our arm pid occelating as if we set the PID wrong. The first thing we did when we got back to the pit is tether the robot and try to replicate the problem. Here, we found that not only could we not replicate the problem, but we could not find any issues with communications or dropped packets in our driverstation logs. Now we go into debug mode. Where is this mysterious bug and what could be causing it? We then noticed we had been pushing 100% cpu on the cRIO without even running teleop. (This is odd because all of our vision processing is done off-board, on the driverstation) We then checked all of our code and loops for delays. After not finding any, we asked NI for help. They continued the search, and we ran into dead end after dead end. We started disabling code to find the errornous code, and we started to narrow it down when we ran into the double timeout bug. The double timeout bug is where (as it has been explained to me) where the cRIO thinks its connected to something but isn't. That causes it to do weird things like not let you deploy code and such. The only way to fix it (that I've found) is to reboot the robot. If it continues then turn on "No App" switch on the cRIO and deploy code again. Then restart the cRIO without the no app switch. This should solve the problem at least for awhile. So as we were testing for the code that was flooring the cpu, we decided just to upload the original code and run the profiler, and that's where the weirdest thing happened. The cRIO stopped being floored. It cut the usage in half without a single change. If anyone else has had an issue similar, please let us know so we can fix this properly. Thanks! |
Re: 100% CPU usage and double timeout bug
I'm not sure about the double timeout as a cause, but I've seen the effect you've described this year. After some optimization, our code uses about 50% of available CPU, but we still need to use the No-App switch to deploy new code, or we time out during deploy:
1) Flip switch to No-App 2) Software reboot robot 3) Deploy built LabVIEW project 4) Flip switch away from No-App 5) Software reboot robot I've also seen intermittent instances where for some runs the machine will be pegged to 100% cpu, but the issue will not repro once we reboot the machine. |
Re: 100% CPU usage and double timeout bug
we had a similar issue, which turned out to be caused by a loop freewheeling. Because it was our main drive loop, we didn't see any difference, but we couldn't deploy while it was running...
|
Re: 100% CPU usage and double timeout bug
I am not at all a Labview programmer, but we had a similar issue of CPU pegged and weird things happening, and it turned out to be certain "tasks" (events? what do you call them?) being executed in pseudo-random order in the loop because they were not wired together, like I suppose they needed to be, but nowhere documented that they had to be. I believe one of the things that was part of the issue was the way we had implemented the compressor code or library.
Sorry for the vague description. That's the best I have from my student. He knows what he's doing but can't communicate it to a level I can understand and re-communicate. This is why I'm the design and mechanical guy. :/ |
Re: 100% CPU usage and double timeout bug
I saw the issue with deploying with a few teams this year. I'll probably be disabling the safety config in the disabled code next year. We are also asking the RT team to make the deploy more aggressive. Currently, it is too easy for a busy cRIO to take a long time to do the deploy.
It isn't clear why this is called the double timeout, or how the deploy is related to the excessive CPU usage. We're there diagnostic errors due to timing or other issues? That seems like a possible reason for the CPU usage to be high. Greg McKaskle |
Quote:
Quote:
|
Re: 100% CPU usage and double timeout bug
Quote:
Our robot returned from St. Louis, less one of our 10 CAN Jaguars, due to our bridge-tipper being removed for shipment after CMP. As we don't have a bridge in our lab, we felt it OK to leave the manipulator and controlling Jaguar off. We fired it up several times this summer, and kept getting the watchdog error, accompanied by shuddering, when all motors would momentarily switch off, and then right back on. We immediately thought that the code was waiting for a reply from the non-existent jaguar, so we drew disable blocks over the related code in begin.vi and our timed_task.vi. The behavior did not change. Returning to the above quote, our understanding was that drawing the disable blocks, removed the underlying code from the compilation step. Is there still code (like the safety system) that still gets compiled in, even if it is "disabled"? Not sure if it makes a difference, but we are using the original 8-port cRIO, and occasionally find it temperamental to deploy code, or re-image. -- Len |
Re: 100% CPU usage and double timeout bug
The disable structure does what you think. It disables the code as if it were deleted. The wires leaving the structure have the default value or whatever you wire up to the Enabled frame.
The comment about safety being disabled was referring to a simple modification to the default Disabled.vi to ensure that it doesn't cause safety errors. As for the potential cause of your error. Does the CAN topology make sense with that one disabled? Also make sure that the CAN connections, cables, and terminator are good. Check to see if there are errors on the Diagnostics Message box, and potentially add the Elapsed Times VI to the loop that you think may be running to slowly to update the RobotDrive often enough. Greg McKaskle |
Re: 100% CPU usage and double timeout bug
CAN is working great. We are using a star topology, and termination is working. 2CAN reports no errors. We'll be looking at inserting elapsed times next week when we get back into the lab, after school starts.
The weird thing, is that everything was "working great" after CMP for a few local demos. At least that is what our students maintain. This summer, there were no on board code changes. We did change our custom dashboard to disable target seeking for our turret. Now it looks like our robot has epileptic fits. The packet structure out of the dashboard is the same, and we can switch over to the prior seeking code or our original calibration mode. Everything works, just with the watchdog error. This lead us to chase possible intermittent connection causes. We swapped out Ethernet cables and checked every cable connection we could think of. We had some prior problems with cable retention on one port of the D-Link switch, but this was not the problem. Not finding problems with any cables, we're back to searching for potential software issues. I didn't even think to check for code in the disable.vi. I know that we reference and close that missing jaguar in the disable.vi. Do you think that could be triggering the problem? -- Len |
Re: 100% CPU usage and double timeout bug
I reread the post with less bleary eyes and noticed that you said you received a watchdog error. The disable topic was about Safety Config, so I had them confused.
Watchdog is not on by default, and Safety is enabled only for the RobotDrive. Please determine which is on, try turning it off and verify that the symptoms change. If the jerking is being caused by WD or Safety, then it means you are missing deadlines. It may also mean you have a workaround. Assuming you are missing deadlines, I'd verify you have no errors in the Diagnostics panel, as the current mechanism for catching errors and shuttling them to the window is quite heavy and can cause you to miss deadlines. If a missing jag is still being referenced, or a disable structure causes a wire to be bad and causes errors, ... The disable issue is problematic to the original thread because most robots are in disable when they are being reimaged or reprogrammed. If disabled robots are throwing errors they take longer to respond and sometimes require a NoApp switch or similar. Making the disable code less CPU intensive due to errors seems like it will resolve many of these issues. I don't think disable has any impact in your robot's twitch. Greg McKaskle |
Re: 100% CPU usage and double timeout bug
The closest thing I've seen to what you describe is when we deploy code from a given computer while tethered. Now, while the code is running, we disconnect the tether. If you reconnect your tether cable you will be unable to redeploy code until you reboot the crio or power your robot up then down.
I cannot say that I've ever seen a situation exactly like what you describe. Once the cRio timed out, it won't do anything for us until we reboot it. Regarding the disabled mode and not deploying, we ran into the same problems with problematic deploys ALL the time last year (2011). What we found was that because we have every sensor, data accumulator, etc enabled in disabled (including vision) so that we could debug, it pushed the the CPU too high to do a successful deploy. We now encase all of our disabled code into a single If/Then case connected to a Button on the front panel. The default position of that button is OFF, so when we deploy to the robot permanently none of the extraneous sensor stuff runs in disabled mode. When we temporarily deploy from the programming computer, we turn the button on when we need the data to tune things, then turn it back off to deploy so there is no code running in disabled mode. |
Re: 100% CPU usage and double timeout bug
Quote:
Quote:
-- Len |
Re: 100% CPU usage and double timeout bug
Quote:
-- Len |
Re: 100% CPU usage and double timeout bug
I'm a little late in posting about this, but here's a follow-up on our issue.
Our main programmer deleted all of the code related to the no-longer present CAN Jaguar. I verified that this code was covered by Disable blocks, but in some cases, still had wires coming or going. All of the problems and all of the watchdog errors we were having went away! Now, our diagnostic log on the DS is completely empty, except for a new message at references the watchdog. Again, we are not in any way calling any watchdog functions in our team code. The new solitary error code is as follows: Watchdog Expiration: System 1, User 0 Because the robot is now working great, I'm of the opinion that this error doesn't matter. At the same time, I am a little concerned because we shouldn't have any error for a function we aren't calling. -- Len |
Re: 100% CPU usage and double timeout bug
Quote:
|
| All times are GMT -5. The time now is 04:40. |
Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
Copyright © Chief Delphi