RC Error: Bad Download!?

On Saturnday, April 12, at Nationals, our robot experienced a strange Robot Controller error which crippled it through the User Control Mode of the two matches we had that day. To view these matches, check out arc_108 and arc_123 at http://www.soap108.com/2003/movies/arc/.

Early on Saturday, We perfected an autonomous mode program designed to take out a HP stack and then go up the center of the ramp (showcased in Match #123). We downloaded the program, and all 4 pages downloaded successfully (at least that is what it said). We proceeded to our first match of the day (#108). In the match we ran an old autonomous routine (straight up the center). However, the pitch gyro on the robot did not stop the robot after it went over the ramp (as it should have), so as result the robot hit the wall. There the robot sat for the most of the match except for a short period at the end where it crawled with drastically reduced power.

Once the robot got back to the pit, we frantically tried to troubleshoot the problem. Nothing was disconnected or broken. We tethered the robot and ran User Mode (we did not run Auto Mode first, but in hindsight we should have). Everything ran fine, the motors were giving full power. With the pit announcer queuing us for about the third time, we rushed the robot back to the field hoping the error would not happen again. The only explanation I could come up with to explain last match’s performance was: the gyro malfunctioned, causing the drive motors to stall out against the wall, causing the voltage to drop below critical, causing the controller to enter Safe Mode (a problem we had last year), causing the motor power to be reduced.

In Match 123, We ran the new autonomous routine. We successfully knocked over a HP 5 Stack and went up the center of the ramp. At the end of auto mode, the robot rolled back down the ramp where it remained immobile for the rest of the match. As in the last match, functions other than the drive train seemed to be working fine. I knew Safe Mode was not the problem.

We got the robot back the pit and ran new tests. We tethered the bot and ran User Mode, it was fine. We ran Auto Mode (with dongle) and then User Mode, and the problem occurred again. We figured out there must be something wrong with variable declarations across pages so we loaded (successfully) the program we used coming into Nationals (although it had identical variable declarations). The robot worked flawlessly. Curious, we reloaded the new program we loaded that morning (the exact same one that was having problems) and all 4 pages downloaded successfully. The problem was gone! We even ran a practice match on 118’s ramp to confirm that the robot was working correctly.

The best explanation I can give for this strange occurrence is a communication error during the download (which was completed successfully) caused a pointer to be changed. I assume the RC uses a CRC (Cyclic Redundancy Code) to verify data integrity over the download. A CRC is very good measure, but there still is a slight chance the data was corrupted somehow. All that would have to happen for a motor problem like this to occur in User Mode would be a bad reference for p1_wheel. We dont declare p1_wheel in the Autonomous pages of the code (since it would be 127) but we use it as a throttle in the User Mode page (it scales the joystick, a low value would nearly disable it). Maybe it got crossed with the Pitch gyro integration variable (which didnt work in the first match and wasnt used in the second) which is used only in Autonomous Pages. The throttle (p1_wheel) was turned all the way up while the error occurred so that was not the problem. The throttle behaved normally after the program was reloaded.

I think this error occurred because a fluke hardware miscommunication corrupted the software in the EEPROM banks of the RC. I couldnt put debug statements in the code, and I cannot reproduce the error, even with the same code, so I will never know for sure. If anyone has any incite into this occurrence or similar experiences (however unlikely that is) please post about them to help better understand this problem. I will answer any questions you post or pm about incident.

P.S. Sorry for the length of this post, but I thought I should tell my whole story.

We had a related RC experience at Nationals, but luckily the driver noticed it before going out on the field.
Either the slot we use for initialization had a corrupted communications statement or the Main driving program slot had a corrupted Serin statement.
Re-downloading the program without any changes corrected the problem and it didn’t recur.

The same think happened at least once in January during program development. Otherwise, our downloads have been very reliable, but unfortunately, you can’t depend on pBasic editor verification checks. I kind of doubt if it does a verification check at all, but if it does it’s probably limited to checksums where two wrongs can make a right answer.

*Originally posted by The Lucas *
I assume the RC uses a CRC (Cyclic Redundancy Code) to verify data integrity over the download. A CRC is very good measure, but there still is a slight chance the data was corrupted somehow.

I took a look at some of the data packets produced by Parallax’s tokenizer. (I’ve been playing around with adding it to my BASIC Stamp Preprocessor.) They actually use a simple checksum rather than a CRC. This dramatically increases the probability that a communication problem will slip through undetected.

We are looking for all input with any electrical problems especially the controllers. We have been finding some interesting results after the many conversations with people. As Greg Ross has related in his post, the error checking on the download is not that great, but everyone is using it and it has very few problems.
What we have experienced is the substandard serial cables that many teams pick up. So frequently we were asked to come to a pit and help diagnose a problem, only to find a bad cable, or three or four cables strung together, or most often, a damaged cable. (i.e. nicks, sharp bends, electrical tape over a cut, etc.) Once the suspect cables were replaced (or borrowed) most problems went away.
Other problems that occurred were just dumb problems caused by the emotion of the time. During one match, our robot just sat still doing nothing during auto mode. Everyone thought that was our strategy but the reality was that an official thought something was wrong at the players station and hit reset on the field interface, that unset auto for our station and we did not reinitialize after that. We now recheck that the OI is reset properly and ready to initiate auto mode at the start of every match. Problem solved.
We also found some teams (especially during practice) are plugging in their controllers on the next field before the match starts. This sets their modems on and they try to handshake with the robots that are next to play. The result apparently is bad data packets. Apparently, if there is enough bad data packets, the interfaces shut down(failsafe)and the only way to correct it is to reset or repower. (I’ll let someone else who knows elaborate on that process.) During Midwest Regional we were finding many instances of bad data all throughout the practice day. That pretty much vanished by game day, thankfully. We were able to find many teams who were powering up in the pits without tether and during practice, powering up the player side while another team was practicing. (Although we approached the teams and asked that they not run untethered in the pits, some continued throughout the weekend.) We were not able to put all the evidence together until recently. After discussions with Innovation First and all the teams that have had problems with communications (including us) we believe that most of the problems are due to multiple modems coming on. At nationals with four fields running four robots at the same time, it is a tribute to the control system (and Innovation First) that we did not have more problems.