[FTC]: Playing Forensic Roboticist.

I love trying to find out why things work (or don’t work).

Most teams were frustrated at the MD Regional (the only one I attended) at least once by having dead controls at the start of Teleop. There didn’t seem to be any reson for this as the FMS indicated that the connection was good, and the programs were running.

I’ve been doing a lot of code tweaking over the last day, and I’ve seen something which may shed a light on this.

I was running mock matches with the bot up on blocks, and I was just running and re-running the same code looking for a problem.

At least 3 times in a sequence of 10 runs, the program ran Autonomous, and then when when it was meant to transition to Teleop, it stopped the Auto program and then failed to restart the NXT in Teleop. It just sat there with the “ready to run” screen up.

What was more disturbing was that the FMS system indicated that the NXT was “running”, whatever that means.

So to all intents and purposes it “appeard” that everything was fine, but there was no way to drive the robot.

At the end of teleop, the match was stopped but if you looked at the NXT then, everything seemed OK (since the program SHOULD have been stopped at that point).

So I suspect that if the real FMS software works the same way, there may have been many instances where the USB controllers and Bluetooth were just fine, but there was no Teleop program running. (Just frustrated kids wiggling controls)

I wonder if this problem could be minitigated by sending several “RUN TELEOP” commands to the NXT. Once is bound to catch.

Now I’m thinking of a way to indicate that the TELOP program is running on our NXT, so we can allert the field crew that we have a problem that’s not Code related. Maybe a flashing red light sensor???

Anyone else see this while testing?

Interesting… Were you running a full field (i.e. 4 robots?). I have not yet seen this problem in our team’s test field, but we have been only running our single robot with the FMS. We did experience this at the Ontario tournament along with several bluetooth issues.

I’m going to try to set up a similar test bed as yours, but with just 4 NXT controllers and the FMS to see if I can recreate the failure rate (3 out of 10) that you reported. Can you tell me if you are using Windows XP or Vista? And what service pack level are you using?

Like yourself, I’m also interested in getting down to the bottom of these mysteries.

I was running a minimal system, with 1 Robot and one Controller.

In this instance I’m running Windows XP Professional 2003 SP3

Phil,
We have never seen what you describe, and we’ve run the fms hundreds of times trying to get the autonomous to do what we tell it. Teleop always runs fine after autonomous. We see the opposite alot, where autonomous doesn’t run on the first try. Removing the battery etc seems to make autonomous more reliable but is a big pain when you are running mutlitple times to tweek your autonomous.

Also we’ve been in three competitions (VA, DE, and MD) with two robots and haven’t seen it in competition. At Delaware, many other teams had issues which they now claim were due to faulty USB hubs. We did loose control once near the end of a run when a robot ran into the wall, which supports their faulty usb hub theory.

Does the nxt look like it’s running? Have you tried adding a lego motor and see if it responds? Have you tried with the Logetech controllers directly into the computer rather versus in a USB hub or vise versa. We seem to be having an issue on our laptop with the USB ports getting loose due to all the plugging and unplugging.

Jon

In the example I just descibed, the NXT was clearly not running.

It’s the old story though, the more debugging you put in, the harder it is to know whether you are effecting the operation of the program.

To make it obvious that the program was running, I added a light sensor, and turned on/off the “illumination” to make it act as a signal light. One flash per second for Auto, and two flashes flashes per second for Teleop.

It make it pretty clear when the program was running/stopping.

Just this morning I caught the program terminating when it wasn’t meant to.
I thought I saw a message go by, so I hooked up a video cam to watch the sequence. After running perfectly for 10 times, I finally caught the glitch.

http://www.ourcoolhouse.com/images/NXT_FileError.wmv

If you play the linked video, you’ll see the program shut down at T=14 seconds. If you freeze frame it, you’ll notice an interesting error message just as it dies. See attached pic

  • File Error - Wassup ???

OK, I guess I have some more googling to do.

Note: I have checked my NXT memory and I’m only about 1/2 full.
57K out of 115K bytes used???

FileError_0001.jpg


FileError_0001.jpg

All:

Welcome to another episode of CSI: FTC. Excellent posts and great detective work.

Phil: Listening to the sounds of the movie you posted, it sounds as if you are manipulating your arm at the time of the “File Error”. Maybe there is something that is causing a electric surge on the NXT cable from the motor to the controller to the NXT. Are all your cables okay? Any static issues possible? (I know that’s an FRC issue this year). The observed behavior is that something is causing the NXT to abnormally terminate your teleop program. This kind of thing needs to get sent to PITSCO and HiTechnic (and maybe even NI) as this appears to be a problem with hardware causing the fault.

Mannie

Hi Mannie.

You’ve got a good ear. Yes in this instance it happened just when the arm was moving. It is possible that this is a common occurance, although it’s hard to say because the fault is only really apparent when a motor screws up. The previous sound also indicates that the wheel drive had reached it’s endpoint but had not been taken out of “servo” mode, so it sat there humming until the NXT timed it out. So it’s likely that the actual problem had occured earlier, and it wasn’t until the NXT commanded the next wheel movement (after the arm move) that the File Error occured.

Our wiring is pretty good (all tied in place, and protected from vibration etc) but I know that these motors can surge a lot. One problem with the very strict FTC rules is that there is no opportunity to use good electrical practices to minimize possible problems on a larger robot. For example, is it’s required to daisy chain the power through all the controllers. The rules prevent using heavier guage wire, or making any soldered distribution bus to give each controller it’s own dedicated power lines. I cringe at the current surge that must be going through the first set of screw terminals. I also wish that I could use a resettable fuse (like FRC) on the motors to prevent that horrible burned wire smell from being set free.

My problem is that the fault is random, and unlikely to be able to be replicated by Pitsco or NI. When doing my earlier encoder tests, I was able to generate some good plots showing the problems, and these were generally addressed by NI, but now they are less reproducable.

As you imply, I do think that the robot structure and code complexity is causing a problem… just HOW is the question.

One problem with the NI code is that there are lots of places where there is an “error” output from a particular VI. The issue is: what does this mean, and what should we do about it. If a move command generates an error output… is there really any way to recover? probably not. I should probably just abort Auto and wait for teleop.

Both NI and Hitechnic have been helpfull, but both imply the problem lies with the other’s equipment. No-one is eager to step up to be the systems integrator and really address the problem. It appears to be falling on the end-user to do this.

Two ideas gelled for me today, and made me go Hmmmmm.

After a bunch of perfect test auto runs, I had my File Error occur on two runs in a row. It occured to me that I hadn’t powered down the NXT for a while, so I did, and things started working again.

It’s amazing how we assume that computers “just run” these days, because it took me this long to suddenly think “memory leak”.

All of a sudden I wondered if there is a memory leak in the labview NXT firmware, and perhaps these problems occur based on how long the NXT has been running, or how many times a certain type of operation is performed.

I did some more research on the “File Error” message and it seems like it may be related to 1) An invalid array index or 2) failure to allocate dynamic memory.

I could see how a memory leak could cause #2.

I also wonder if just sitting on the playing field, running the code template for 2-3 minutes could cause a memory leak to grow to the point where it shuts down the NXT program when you start really doing stuff.

On a related note, during FLL season, I casually used an FTC NXT in one of our FLL robots (that was programmed in NXT-G). This NXT had been upgraded to firmware version 1.12. After several instances of the NXT program just locking up while it was running, I went back to the “release” 1.05 firmware. Never had the lockups again… Hmmmmm.

Does anyone know if there is a way to read the amount of “free memory” while running a program on the NXT… it might be enlightening…

Phil.

Hmmmm… I wonder if ROBOTC users see this as well.

NI/PITSCO - please notice this and investigate.

Phil,

I can’t help you with checking available memory while running, but we have seen file errors before. However, all of these could be traced back to coding errors. At the moment, I cannot remember exactly the problem, but I thought it had to do with attempting to talk to a motor or servo controller on the wrong port (or wrong place in the daisy chain). At the time we were probably using NXT-G (but could have been LabView, we went through all 3 packages before settling on RobotC). Is it possible that you have a coding problem in a section of code that is only run on rare occasions possibly due to the position of the robot at the end of autonomous/beginning of Teleop.

As for the frustration of the robot not responing to controls, our team managed to experience that twice at the MD competition. The first time it was caused by a team member forgetting to plug in the external battery in when we swapped for a fresh one. Fortunately our team mate won that qualifying match for us. The second time was in the semi-finals when the RobotC software was not detecting the external battery and refused to run. After the match, we unplugged and re-plugged the battery and it registered normal!?!

We competed 1 week later at the PA tournament and had no control or connection problems with our robot (we also reset bluetooth between all matches). However in our teams second semi-final match at least 3 robots locked up (though one may have been broken) just as our team mate was about to score 2 racks. This would almost have to been an FMS or USB problem.

David

PS. Good luck at Atlanta

Testing on RobotC:

So far I have not yet been able to recreate Phil’s FMS problem, nor have I seen the File Error problem to date. We are using RobotC, so that may be a factor in the system. I have experienced bluetooth problems when the NXTs auto power down.

I will keep on testing to see if I can recreate the problem.

Paul Tan.

Mysteriously, software errors have not been our biggest problem, although we did lose a pair of drive motors when a robot went into “drive full speed forever” in autonomous and the referees wouldn’t let our drive team retrieve the robot before the motors released their magic smoke. This affected all four robots in the match, although only two smoked their motors. In one other incident where it took the FMS about 15 minutes to talk to one of our robots (that had happily talked to the same controller an hour before), our on-field software incidents have been very few. We use RobotC.

Your team name seems well earned :slight_smile:

… Exothermic Robotics Club

Well, as the students say, it was on fire when we got there.

That’s good news because it probably means that my “fault” is not systemic. It may be localized just to complex LabVIEW code.

With only my “gut” to go on, I’m leaning towards the conclusion that the NXT Firmware that’s required for LabVIEW has some sort of memory leak or other error that causes the program to terminate if it’s been run multiple times without being re-booted.

It also seems that downloading a new program can cause simmilar issues unless the NXT is rebooted before running. I’ve seen this a few times.

It’s quite possible that this problem is agravated by the complexity of our code in general, but my experience from testing today, is that if the NXT is rebooted EVERY TIME before a match, I never see the Program Abort (which seems to be due to the “File Error” problem).

My only concern is that if we have to wait too long before actually beginning Auto, the same problem may materialize. I only hope that the memory loss (or whatever) is due to something like memory not being freed up when the program ends… Thus the first run should always be safe.

… Some other stuff I’ve determined… that may only apply to LabVIEW

It seems llike the generic driving “Glitches” that we’ve been experiencing may come from incomplete motor commands reaching the controller. The Hitechnic motor controller is not able to tell when a full message has been received, so if only some of the bytes arrive, the controller acts on them anyway. If these bytes are part of a “Specify Position” type command, then the command is only sent once and consequently if it’s incomplete, the robot will go to an incorrect encoder destination.

Instead, by continually resending the desired command, I’ve been able to completely eliminate any erratic drive actions. I split the “Specify Position” VI into two parts. One part to set the destination, and the other part to check if it’s reached it yet. I can now call them in a loop without “blocking”, to reinforce any incomplete command transfers. By intermixing both Wheel and Arm commands I can command both systems to move simultaniously, with them both stopping automatically when they reach their target position.

A sidebar to this discovery is that anyone who has been doing their own “drive and check encoder” loop probably hasn’t seen quite as many driving glitches, because they have been overwriting any partial commands.

With these latest code changes, and some other timeout tweaking, we’re feeling a LOT better about our BOT’s Auto performance. We did several tests of each of our 8 Auto modes (red/blue, floor/ramp, score/spoil), and never saw a single glitch.

Phew… Things are looking up :slight_smile:

I have done lots of testing with RobotC on two NXT’s, I haven’t had any problems yet. Although at the competition I had this happen to me, my alliance partners and opposing alliances many many times. This needs to be solved. Hopefully before the championship as it’s probably the most backlogged tournament already.

I’ve barely been able to recreate what actually happened to us at the tournament – basically at the Ontario FTC Championship, we saw our robot keep moving, even though we had lost all joystick control.

We’ve been at this for the past week, so this is the first time that we actually saw it happen again.

On the screen of the NXT, it shows that the bluetooth was disconnected (the diamond beside the bluetooth symbol became a less than sign), but the teleop program was still running. The motors were still doing the last thing we told it to do before the bluetooth connection got lost – AND THE ROBOT DID NOT SHUT DOWN THE MOTORS after the connection was lost.
The NXT screen was still showing “Teleop RUNNING”

I don’t know if we can recreate this problem again or not as it was the first time we saw it in the past week of testing and practicing. The FMS showed that the robot was still connected on bluetooth.

This is a brand new NXT (out of an unopened box as we had replaced the original NXT thinking it might have been flaky), Robot C 1.46 with the latest firmware, running the competition templates.

This happened to pretty much every team at least once at the Ontario regional (IIRC). Specifically HiTechnic needs to work on this…

Not sure if this is a HiTechnic problem or not. The NXT was still running the Competition template code when it lost the bluetooth connection – shouldn’t that have stopped the robot by terminating the teleop program? That’s what I would have coded for safety (i.e. an out of control robot still running in teleop… think of the burned out motors, let alone the refs running away! – remember, the NXT will never know when to stop as the bluetooth link disappeared)

How is “move a specified distance” implemented in robotc?

In labview, the vi disables the MC’s 2.5 sec timeout and then issues a “run to position” command to the MC. LV then sits in a blocked loop waiting for a “done” from the MC. LV uses it’s own 10 second timer to abort if the target is not reached.

This strategy has many downsides: the NXT will not respond to a disable, and even if the program is halted, the MC will continue to run until it reaches it’s target.

You need a sepparete loop checking enable status to be able to break the motion.

As for the “connected status”, this must have a longish timeout. I can turn off my NXT and the status shows Connected until I turn it back on again.