100% CPU usage and double timeout bug

Hello Everyone!
I’m the programming lead for the TechnoKats, and we’ve been having some interesting issues/results that we’d like to share and see if anyone else has been having similar issues.

Today, after our practice match, we took to the practice field where we had large delays and weird results from our PID loops. This included our arm pid occelating as if we set the PID wrong. The first thing we did when we got back to the pit is tether the robot and try to replicate the problem. Here, we found that not only could we not replicate the problem, but we could not find any issues with communications or dropped packets in our driverstation logs.

Now we go into debug mode. Where is this mysterious bug and what could be causing it? We then noticed we had been pushing 100% cpu on the cRIO without even running teleop. (This is odd because all of our vision processing is done off-board, on the driverstation) We then checked all of our code and loops for delays. After not finding any, we asked NI for help. They continued the search, and we ran into dead end after dead end. We started disabling code to find the errornous code, and we started to narrow it down when we ran into the double timeout bug.

The double timeout bug is where (as it has been explained to me) where the cRIO thinks its connected to something but isn’t. That causes it to do weird things like not let you deploy code and such. The only way to fix it (that I’ve found) is to reboot the robot. If it continues then turn on “No App” switch on the cRIO and deploy code again. Then restart the cRIO without the no app switch. This should solve the problem at least for awhile.

So as we were testing for the code that was flooring the cpu, we decided just to upload the original code and run the profiler, and that’s where the weirdest thing happened. The cRIO stopped being floored. It cut the usage in half without a single change. If anyone else has had an issue similar, please let us know so we can fix this properly.
Thanks!

I’m not sure about the double timeout as a cause, but I’ve seen the effect you’ve described this year. After some optimization, our code uses about 50% of available CPU, but we still need to use the No-App switch to deploy new code, or we time out during deploy:

  1. Flip switch to No-App
  2. Software reboot robot
  3. Deploy built LabVIEW project
  4. Flip switch away from No-App
  5. Software reboot robot

I’ve also seen intermittent instances where for some runs the machine will be pegged to 100% cpu, but the issue will not repro once we reboot the machine.

we had a similar issue, which turned out to be caused by a loop freewheeling. Because it was our main drive loop, we didn’t see any difference, but we couldn’t deploy while it was running…

I am not at all a Labview programmer, but we had a similar issue of CPU pegged and weird things happening, and it turned out to be certain “tasks” (events? what do you call them?) being executed in pseudo-random order in the loop because they were not wired together, like I suppose they needed to be, but nowhere documented that they had to be. I believe one of the things that was part of the issue was the way we had implemented the compressor code or library.

Sorry for the vague description. That’s the best I have from my student. He knows what he’s doing but can’t communicate it to a level I can understand and re-communicate. This is why I’m the design and mechanical guy. :confused:

I saw the issue with deploying with a few teams this year. I’ll probably be disabling the safety config in the disabled code next year. We are also asking the RT team to make the deploy more aggressive. Currently, it is too easy for a busy cRIO to take a long time to do the deploy.

It isn’t clear why this is called the double timeout, or how the deploy is related to the excessive CPU usage.

We’re there diagnostic errors due to timing or other issues? That seems like a possible reason for the CPU usage to be high.

Greg McKaskle

I agree, that’s why I came here to share my lack of knowledge so hopefully we can not have the bug again. It was termed the double timeout bug by one of the NI guys, Kevin something. He said he had seen it before and hated it.

It kept giving us errors as if begin was finishing after periodic tasks and such were opened, refnum errors and such that didn’t actually cause anything to break once the robot started, but it did put out LOTS of errors. I can post the code later today when I go to the shop if you’d like.

I was going to post on the “Watchdog Not Fed!!!” thread, but the quoted statement in this thread caught my eye.

Our robot returned from St. Louis, less one of our 10 CAN Jaguars, due to our bridge-tipper being removed for shipment after CMP. As we don’t have a bridge in our lab, we felt it OK to leave the manipulator and controlling Jaguar off.

We fired it up several times this summer, and kept getting the watchdog error, accompanied by shuddering, when all motors would momentarily switch off, and then right back on.

We immediately thought that the code was waiting for a reply from the non-existent jaguar, so we drew disable blocks over the related code in begin.vi and our timed_task.vi. The behavior did not change.

Returning to the above quote, our understanding was that drawing the disable blocks, removed the underlying code from the compilation step. Is there still code (like the safety system) that still gets compiled in, even if it is “disabled”?

Not sure if it makes a difference, but we are using the original 8-port cRIO, and occasionally find it temperamental to deploy code, or re-image.

– Len

The disable structure does what you think. It disables the code as if it were deleted. The wires leaving the structure have the default value or whatever you wire up to the Enabled frame.

The comment about safety being disabled was referring to a simple modification to the default Disabled.vi to ensure that it doesn’t cause safety errors.

As for the potential cause of your error. Does the CAN topology make sense with that one disabled? Also make sure that the CAN connections, cables, and terminator are good. Check to see if there are errors on the Diagnostics Message box, and potentially add the Elapsed Times VI to the loop that you think may be running to slowly to update the RobotDrive often enough.

Greg McKaskle

CAN is working great. We are using a star topology, and termination is working. 2CAN reports no errors. We’ll be looking at inserting elapsed times next week when we get back into the lab, after school starts.

The weird thing, is that everything was “working great” after CMP for a few local demos. At least that is what our students maintain. This summer, there were no on board code changes. We did change our custom dashboard to disable target seeking for our turret. Now it looks like our robot has epileptic fits. The packet structure out of the dashboard is the same, and we can switch over to the prior seeking code or our original calibration mode. Everything works, just with the watchdog error. This lead us to chase possible intermittent connection causes.

We swapped out Ethernet cables and checked every cable connection we could think of. We had some prior problems with cable retention on one port of the D-Link switch, but this was not the problem. Not finding problems with any cables, we’re back to searching for potential software issues.

I didn’t even think to check for code in the disable.vi. I know that we reference and close that missing jaguar in the disable.vi. Do you think that could be triggering the problem?

– Len

I reread the post with less bleary eyes and noticed that you said you received a watchdog error. The disable topic was about Safety Config, so I had them confused.

Watchdog is not on by default, and Safety is enabled only for the RobotDrive. Please determine which is on, try turning it off and verify that the symptoms change. If the jerking is being caused by WD or Safety, then it means you are missing deadlines. It may also mean you have a workaround.

Assuming you are missing deadlines, I’d verify you have no errors in the Diagnostics panel, as the current mechanism for catching errors and shuttling them to the window is quite heavy and can cause you to miss deadlines. If a missing jag is still being referenced, or a disable structure causes a wire to be bad and causes errors, …

The disable issue is problematic to the original thread because most robots are in disable when they are being reimaged or reprogrammed. If disabled robots are throwing errors they take longer to respond and sometimes require a NoApp switch or similar. Making the disable code less CPU intensive due to errors seems like it will resolve many of these issues. I don’t think disable has any impact in your robot’s twitch.

Greg McKaskle

The closest thing I’ve seen to what you describe is when we deploy code from a given computer while tethered. Now, while the code is running, we disconnect the tether. If you reconnect your tether cable you will be unable to redeploy code until you reboot the crio or power your robot up then down.

I cannot say that I’ve ever seen a situation exactly like what you describe. Once the cRio timed out, it won’t do anything for us until we reboot it.

Regarding the disabled mode and not deploying, we ran into the same problems with problematic deploys ALL the time last year (2011). What we found was that because we have every sensor, data accumulator, etc enabled in disabled (including vision) so that we could debug, it pushed the the CPU too high to do a successful deploy.

We now encase all of our disabled code into a single If/Then case connected to a Button on the front panel. The default position of that button is OFF, so when we deploy to the robot permanently none of the extraneous sensor stuff runs in disabled mode.

When we temporarily deploy from the programming computer, we turn the button on when we need the data to tune things, then turn it back off to deploy so there is no code running in disabled mode.

That’s what has been puzzling us. We don’t have anything referring to the watchdog and certainly didn’t add any watchdog vi’s, and we have safety config disabled on our drive. We didn’t add safety config to any of our other motor controls. Our cRIO’s are old, dating back to 2008. Even though we re-image them several times a year, is there a a possibility that old watchdog code is still lingering in a zombie state?

We’ll have to look closer this week. I don’t remember what else was coming up as errors. The watchdog one stands out, as we didn’t think we had any code that used that deprecated system.

– Len

We saw that as well. We pushed our vision processing onto our driver station, to keep CPU utilization down. Interesting solution for debugging and quiet mode. We’ll probably implement something similar this year.

– Len

I’m a little late in posting about this, but here’s a follow-up on our issue.

Our main programmer deleted all of the code related to the no-longer present CAN Jaguar. I verified that this code was covered by Disable blocks, but in some cases, still had wires coming or going.

All of the problems and all of the watchdog errors we were having went away!

Now, our diagnostic log on the DS is completely empty, except for a new message at references the watchdog. Again, we are not in any way calling any watchdog functions in our team code.

The new solitary error code is as follows:

Watchdog Expiration: System 1, User 0

Because the robot is now working great, I’m of the opinion that this error doesn’t matter. At the same time, I am a little concerned because we shouldn’t have any error for a function we aren’t calling.

– Len

The system watchdog is based on communications and can’t be disabled. The user watchdog is what you can control programmatically. I believe it’s normal (or at least not uncommon) to see a single System watchdog expiration when starting the program.

That single system watchdog error is expected. It is an artifact of the timing when the robot transitions from disabled to enabled. Don’t worry about it.