Robot/Dashboard communication and the FMS

Team 4146 has experienced a very difficult problem to debug. On very rare occasion, the robot starts up in a weird state. Some systems don’t respond to controls but some do. We have a climber that extends in and out automatically. When this elusive problem occurs, the arm just oscillates back and forth forever until power is cut. Upon rebooting the cRIO (without changing anything else), the robot behaves normally.

Naturally, staying true to Murphy’s Law, the problem happened at a very high frequency when connected to the FMS during our competition this passed weekend.

Right now, my biggest suspect is the robot-dashboard communication. We program in Labview, so we use the WPI library of “SD” VIs to transfer data between the cRIO and the Dashboard. Our final code had about 20 or so double precision values being sent back and forth.

The reason I suspect the data transfer is because in our attempt to fix the problem, I prevented all the code in Periodic Tasks from executing during autonomous mode except for the SD comms. The problem still occurred immediately in autonomous mode. At one point, we even had our autonomous code completely disabled. So the ONLY thing running during autonomous mode (besides all of the FRC background code) was the SD comms in Periodic Tasks.

I’m curious if Greg McKaskle or Mark McLeod (or any other Labview gurus) have any thoughts on this. The main problem is the inconsistency of the problem occurring. It happens much more frequently when connected to the FMS during a match, but we obviously can’t test and develop in that environment.

Dear Rustree,
I had field communication issues using the KOP 09 laptop as the field driver station and having some Dashboard parameters that were not critical for field operation. I eliminated as many dashboard parameters as possible and set up a newer laptop as the driver station and my intermittent (slow response) robot operation on the playing field ceased.

There is a FRC driver station Log viewer that provides valuable information for your operation that is already on your laptop as part of the Labview installation package (with FRC updates). I recommend that you find it and take a look at your CPU utilization rate and trip time latency. When I had “issues” the values were about 10 msec average and spikes above this when we were having “field issues”. After the above software modifications were made (new software, new laptop) the field communication issues went away (slow robot operation) and the trip latency was consistently 4 or 5 msec.

Hope this helps,

Marc,

Thanks for the reply. We were using a reasonably powerful laptop (something much faster than the KOP Classmate), so I doubt it was a straight-out processor speed problem. Although, we never had a chance to test a new laptop during the competition, so the laptop we used could be a culprit. On the other hand, the FTA at our event (Mark McLeod) did spend a few minutes with us after one of our matches reviewing our driver station log for any abnormalities. Nothing showed up that caught his eye. In fact, everything looked fairly normal.

I’m suspicious of a deeper-level corruption because some basic controls on the robot became disabled that don’t depend on successful Dashboard-to-robot communication. It was rather inexplicable.

For our last two matches, I removed all of the SD VIs and just hard-coded the values. For those matches, at least, the problem seemed to go away. I just wish there was some way to test everything in an FMS environment.

-Dan

Dear Rustree,

Let me go one step deeper… may or may not be related. During our Michigan District Competition event at Waterford we lost connectivity to 3 CIM’s on our drivetrain. I troubleshooted it to a bad digital sidecar (at least I thought). In teleop I could see the joystick commands via the dashboard but no response. I connected each CIM from PWM4, PMW5 and PMW6 and they all worked using the sidecar PWM3 output but not their respective sidecar outputs. Note that the software code was quite stable. Assuming a bad digital sidecar, I checked one out from spare parts. Upon replacement I had the same behavior.

Having little hair to spare, and 30 years of embedded microprocessor experience, I was at a loss…so I redeployed the cRIO with my code and the problems went away. I replaced the spare parts digital sidecar with my original and everything was fine again (PWM3, PMW4, PMW5, PMW6 operation). We had no further problems for the remaining 8 or 9 matches.

Moral of the story…sometimes redeploying your known good code into a cRIO with very unusual problems may solve your dilemma. Sometimes! This advice and 50 cents get’s you a coffee on the 4th floor Pontiac, MI coffee station.

Funny you should mention that.

After six matches on Friday with all kinds of bizarre behavior, I spent several hours Friday night modifying the code to incorporate suggestions given to me by some Labview experts. Sure enough, on Saturday morning, everything seemed to work perfectly with the new code! We did well in our last three qualification matches and an extremely generous team 1492 picked us to be in the eliminations with them.

To prepare for the quarter-finals, we adjusted the angle of our autonomous shooter a little by changing exactly one global variable. I redeployed the code annnnnd guess what: in the next two matches the problem came back.

Was it because I redeployed with virtually identical code? Would it have happened anyway? I also am an embedded controls guy… but I still have some hair left. I’m afraid I might lose it all before we get to the bottom of this problem.

What sort of CPU utilization are you seeing on your cRio when the issue is occurring?

This year, while I was away some of our students encountered an issue where one of our periodic loops was overrunning the amount of time we had allotted for it, just slightly. They lost a lot of hair trying to determine why ghost problems seemed to just be appearing and disappearing uncorrelated to the changes they were making, turned out to be that after about 6 minutes of on time, the overrun would result in our cRio not having any CPU time left to perform its other tasks, and seemingly ghost like issues would appear, with solenoids moving in and out and lag and inconsistent behavior on multiple subsystems.

Correcting the loop naturally resolved the issue. For the record, the acquire or create semaphore reference VI is not the fastest thing in the world, and you should save semaphore references, rather than fetch them with it, if they’re going to be used heavily. :stuck_out_tongue:

For the record, the acquire or create semaphore reference VI is not the fastest thing

Are you talking about the actual semaphore VIs in the palette? If so, the issue may be more closely related to leaking the refnums. All I/O handles in LV are shadowed so that aborting or completing will do the close/free operations. And if you leak, that is a lot of resources to shadow. If you probe the refnum and see a unique one each time, you definitely need to close it.

If it is the WPILib Get VIs, which is it you are commenting on.

Greg McKaskle

Not to derail the original thread, as my comment about the semaphore VIs was not related in any known way to the issue the OP was commenting on, but we were using the ‘Obtain Semaphore Reference VI’ from the Semaphore VIs palette inside of a 20hz loop. The semaphore seemed to be behaving properly, protecting the shared global that it was in place to guard, and moving the obtain reference VI call outside the 20hz loop and passing the reference in through a tunnel demonstrably corrected the issue we were observing.

I was asking for several reasons. From the manual … "If you use this VI to obtain multiple references to the same named semaphore, each reference number is unique. "

So it is saying that each obtain should be balanced with a close. If it isn’t, you wind up with quite a few refnums in short order.

I also ask, because the semaphore is often not the right tool for the job. They are a rather dangerous way to protect compared to the critical section that is built into every subVI. So the recommended way to protect a global is simply to wrap it in a function that is not reentrant. This critical section is considerably safer as it cannot be mismatched. The semaphore is only needed when the acquire and release are far removed from each other.

Greg McKaskle

I see what you’re saying, I hadn’t picked up on that nuance of the semaphore functionality. That entirely explains the issue we were observing. Since we scaled back to only the semaphore acquire on loop entry, we see no issue any longer anyways, but its certainly better to understand exactly whats going on than leave any ambiguity.

On the topic of methods of handling critical sections, I agree, more recently we’ve been using functional global variables to implement class like structures, but the concepts involved I find more difficult to explain to students than semaphores. Tasking is a deep enough topic as is to explain to high school aged kids, and I find semaphores a better introduction to the concepts of mutual exclusivity (especially for not -just labview- purposes) than diving straight into non reentrant functions.

Hmm. They certainly need to learn about semaphores too, but the baker’s algorithm or semaphore tied to a function is just so simple and safe. Anyway, glad to have helped. This was fresh on my mind because … guess what I did not too long ago. It really messes with the timing after thirty seconds or so.

Greg McKaskle