FMS Differences with motor safety?

I’m trying to puzzle out why our robot failed to operate under FMS control today with our vision code active. We can run it tethered, and previously ran it via Wifi, without any sort of failures or odd problems. But put it on the field, under FMS, and it crashes. (We turned vision off once, and it ran fine that time).

We did find one problem: a tight motor safety value (0.1) coupled with a misplaced Wait(0.04) meant that we could (rarely) generate a motor safety warning.

And in two of our crashes today, our robot was dead with a string of motor safety warnings.

When we run in the pit, if you get a motor safety warning, your robot skips a beat, but once you go back to refreshing, it goes right back to working.

Does FMS control imply a different error mode? In other words, if you miss one motor safety check, does your 'bot get shot down?

Right now we’re trying to decide whether to run without our nifty auto target code (which requires vision), or to gamble with a motor safety value of 0.5 and risk losing our first match because of it.

Anyone have any experience with differences between pit and FMS?

Cheers,

Jeremy

It sounds like as it was you were operating on razors edge of controlling the motors often enough (“a tight motor safety value (0.1) coupled with a misplaced Wait(0.04) meant that we could (rarely) generate a motor safety warning.”). Operation on the field is likely adding additional delay/jitter to your control packets which may be causing an issue.

Maybe try turning the vision code on and off based on a button. Or only processing an image (or however many you need to target) when a button is pressed.

Your robot will not get shutdown for the full match because you failed to update one motor in time. The timer counts from the last set() call on the motor object. The motor safety watchdog is implemented solely on the cRIO and is dependent on how often your code updates the motor.

The other watchdog on the cRIO is dependent on how often packets are received from the driverstation. If packets take too long to reach the robot, the robot will switch to disabled mode until it begins receiving packets again. I’m not 100% sure, but I think the motor safety timer will still count in disabled mode. If you are lagging on the field, the robot being disabled could be causing your motor timeout messages.

My other thought is that your vision code may be taking too much time to execute. Are you doing your vision processing on the cRIO in the same thread as your motor code?

It might also be beneficial to open NetConsole while you are on the field and watch the output from the cRIO see why the code is crashing.

Yeah, the drivers were supposed to run with netconsole on, but it was among a variety of things that didn’t get done :-/. That’ll be something we hopefully remember tomorrow.

The vision code is all in it’s own thread, and only executing 4 times / second; the FRC log viewer doesn’t show extreme cpu usage (in fact, cpu usage looks quite reasonable right along).

Again, it’s puzzling - this all works fine in the pit. And while we’re arguably ‘wrong’ on the 0.1/0.04 thing, we have to work really hard to get it to trip just once in the pit, and from what you’ve said (and my reading of WPILib source code), that wouldn’t explain what we’re seeing.

What’s tough is that at this point, the safe thing to do is just run without vision. Risking a 0 point match during quals is just too high risk. It’s really a shame we have such limited access to the ‘real’ conditions - it really makes this hard.

Thanks for the suggestions; others gratefully taken. Puzzlers like this are frustrating!

Cheers,

Jeremy

The netconsole port isn’t on the list of ports passed by the field, so I don’t think netconsole will work on the field.

What does the log viewer shows as your trip time and missed packets? Can you post the logs? 0.04 is two missed packets at 50hz.

Have you tried running the robot in the pit using practice mode? If you’re only testing teleop in the pit, you won’t be affected if your robot has a bug in autonomous, while the field will go through both states.

Also, while I’m not sure what difference this would make, the DS won’t necessarily connect to the robot immediately after boot. You need to wait for the radio to boot and the field to be configured for your team #. To emulate this in the pit, turn on the robot with the DS unplugged, wait for your camera and cRIO to boot, then connect the DS and run in practice mode.

No missed packets in the log at the point of failure (some, but 30 seconds previous). Trip times are 3 ms, so quite good as well.

But if we were disabled for a network drop out, wouldn’t the log viewer show that failure first, and then the motor safety failure?

Yeah, we spent an hour running it in practice mode, and we’d run it through practice mode several times previously. We forcibly sent more network table updates (we have two things; vision code, and then we use network tables to send that target info up to a smart dashboard Java plugin). We built fake vision targets to see if we could swamp the vision processing code (that was what got us able to trip over to 0.12, and get a motor safety to fire…once). We desperately tried to get it to fail off the field and couldn’t :-(.

Cheers,

Jeremy

Attached are the logs; forgive me if they are not in the right format.

fail1 and fail3 are similar failures; they show the motor timeout error. fail2 there was some uncertainty on; the programmer thought the driver unplugged the joystick, the driver claimed otherwise. good1 is a run without any vision or network table code (and there were no issues, hence the title ‘good1’).

The vision errors at start up are, so far as I know, ‘normal’; we always get them during vision acquisition processing.

Cheers,

Jeremy

logs2823.zip (67 KB)


logs2823.zip (67 KB)

The first log shows a number of vision errors at the very beginning, presumably the camera is not completed booting and in the timeout case, the image processing fails because the image isn’t fully initialized.

About 38 seconds into the teleop period, the code doesn’t crash, but it does start producing Safety errors every 100ms. I’m not as familiar with the C++ implementation of the safety system, so I can’t tell if this would occur if the errors will be sent repeatedly if the motors are never updated. Based on the pattern of the CPU, the image processing thread was operating normally, but it appears that the thread that calls RobotDrive or Motor functions for drive may have crashed or hung or gone into through a conditional code path where it failed to ever update the motors.

Log 2 does contain an error message at 17 seconds into teleop indicating that a joystick disappeared or started returning errors. Typically this is due to being unplugged. Make sure the drivers know that if you plug in a joystick during a match, you must press F1 for it to be recognized. While driving the robot, it isn’t possible to poll for joysticks automatically or this would interfere with driving.

Log 3 shows a similar stream of safety error messages, but this time starting at 28 seconds into teleop. Again, based on CPU pattern, it seems that vision is still running and processing images.

Log 4 indicates that vision wasn’t used. Nothing interesting here.

I’m afraid there is nothing else in the logs to narrow what is causing the issues other than the time at which the error started. If the drivers or coach remember initiating a certain action on the robot – for example, when they hit button 7 on the joystick, then I’d look into that code or do a test with vision enabled and go through all of the special modes trying to exercise all code paths.

As for the FMS interaction, the FMS whitepaper goes into more detail, but the FMS never directly communicates with the robot. Your robot has great comms the entirety of each match, I do not see any indication the robot was ever disabled except by the safety mechanism. Also, the tracing code, which produces one of the blue traces at the top of the Log file viewer indicates that the teleop packets were still being processed and wherever that trace call is in your code was still being called at 20ms intervals. You can zoom in and see that those traces are in fact many small dots, each indicating what packet was sent and which code traces were executed in response.

Greg McKaskle

Thanks for the detailed and careful analysis. Your conclusions match mine. Occam’s razor would suggest that our main loop code either crashes or hangs up at the failure mark of those two logs.

However, we could not reproduce this unless we were on the field. It may be the elements, it may be the subtly additional latency introduced by FMS; we really can’t figure it out.

A 525 mentor told me that they ripped out all of the network table and smart dashboard code they had on their C++ 'bot, because they experienced similar problems in Duluth. And, nicely, after we removed all of that code, we had no such issues today.

Steve Peterson has graciously offered to review our code and see if they can reproduce the issue at a less stressful time. And I have to say the field guys were very helpful and supportive. As a rookie mentor, I couldn’t have been happier with their response to our woes.

Sadly, we spent today ironing out the other problems that would have been nice to find yesterday, but hopefully tomorrow will be an even better day, and we’ll get to play in elims :slight_smile:

Cheers,

Jeremy

Please identify the loop/thread that is calling the teleop trace function. It is clear from the logs that thread was active. It is pretty clear that the vision thread was active. This may help reconstruct what took place.

As for whether the Network Table communications has anything to do with it or the field, be sure to keep you eyes open to other things as well.

Greg McKaskle

Attached is our code, along with the patch that ‘fixed’ everything (by removing Vision and network tables). Nicely, we did go on to compete well, even putting up our best performance in a 106 point QF match (sadly, we lost the QF due to non software glitches, but it was still a great weekend!).

I spent a bit more time looking at the code, and found one other interesting lead. We sent the current speed of our wheel (if it changes) using NetworkTables every 10 ms. That seems like the sort of thing that could trigger a problem.

So, the summary as I understand it is as follows:
We have 4 or 5 running threads: 2 for vision, 1 for the Driver Station (actively looping inside the Run() method of the DS code - hence the log updates), and a main thread, running inside our OperatorConsole() method. We may also have a thread for our shooter wheel, which is a pid controller thread.

The vision and DS threads all appear to continue working the whole time; the log suggests that. At the failure point, the main thread appears to die. Probably as a side effect of that, the DS thread then goes on to trigger the motor safety code and shut down the motors.

There is one known bug: in a target rich environment, the vision code consumes enough CPU that a mistaken Wait(0.04) in the shooting code can cause us to exceed the 0.10 timeout and generate a motor safety error.

My initial guess (and hence this thread title) was that the FMS system would shut down our robot on detecting a motor safety error; that guess appears to be quite wrong.

Occam’s razor suggests that the logical explanation is that there is a bug in our OperatorCosole() code that causes our main thread to crash. I cannot for the life of me find any such bug.

If I range further afield, my next hunch is that this line of code:
distanceTable->PutNumber(“speed”,shootEncoder.GetRate());
if called every 10 ms can lead to a crash of some kind.

Perhaps we couldn’t reproduce it because the FMS network conditions are subtly different. Perhaps we couldn’t reproduce it because we tended not to run the shooter wheel all the time during our testing (the put only happens if the shooter wheel is running). Perhaps it only happens if we’re doing the PutNumber calls and a call to setErrorData is happening in a different thread. (I don’t have the source code for setErrorData, so I can’t examine that possibility).

The only other faint leads I have are these. We did experience a bug such that if you first did a PutNumber of a variable, and then later did a PutBoolean of that same name, you would get a crash. We also had a superstition that doing too many PutNumbers in a row would lead to a crash (although we were never able to confirm that superstition, and it was muddled with the Boolean/Number crash and the mandatory C++ update that was supposed to fix all kinds of network table crashes).

At any rate, thanks for listening. And thanks to Google for archiving all of these thoughts, in the off chance that it might help some other poor soul <grin>.

Cheers,

Jeremy

team2823-code-2013.zip (25.3 KB)


team2823-code-2013.zip (25.3 KB)

Network Tables doesn’t really define what the outcome is supposed to be when you change datatypes for a given variable name. I wouldn’t expect it to crash, but it is almost certainly a bug when you do this and the various network table implementations are free to return whatever value they feel like.

I still fail to see how the FMS could provoke this. It is more likely that the shiny field or heat of the match are contributing. Your logs show your lag and it is not bouncing around causing timing glitches.

If it was your job, or your programming test, to find this bug and explain the failure, what would you do?

Greg McKaskle

Sorry, if I wasn’t clear. I see no reason to think that FMS or the field are the problem. I think the problem is somewhere on the Robot. However, all of our testing so far only triggers while on the field. So a forensic analysis tries to understand what possible difference there could be on the field that would trigger a problem. And I agree, the best suspect is the target rich environment. But as you noticed, the camera threads all appear to stay alive, and the main loop code has very little overlap with camera information. It’s hard to connect those dots. The difference in latency or bandwidth would appear to be an unlikely trigger (you have to imagine that 2 ms versus 1 ms makes a difference), but I have no other theories, so I mention it.

I hope to run a test to see if generating error messages at the same time Network Tables traffic is being sent can cause a crash; that would test my first hunch. The other step I would take is to put wireshark on while it operates to see if that generates any further clues.

If I had infinite time, I’d take the whole WPILib stack and run it under a tool like Valgrind while also operating our robot to see if I could shake free any bugs in that code. That Network tables code was already the culprit in a nasty set of crashes once this season, and it is not trusted by other mentors I’ve spoken with. It’s a sufficiently complex body of code that simple inspection isn’t enough to identify any problems.

I don’t know enough about the VxWorks environment to know if there is a Valgrind like tool available on the Robot.

Beyond that, though, without the ability to reproduce the problem, it becomes difficult to solve the issue.

Also, note that our robot is done and has no further need for this capability. I’m now just trying to solve the puzzle for my own edification, and for the possibility that it might help another team.

Cheers,

Jeremy

I think the output of the motor safety warnings is a symptom, not the problem. It is my belief that the WPIlib has some issues, I’m thinking in the NetworkTables implementation, where it will stall a thread, keeping it from being able to update the motor controls.

In our “lab” using a practice robot, I would notice that at times the robot would become unresponsive. A few seconds later, the motor safety helper message would start on the console. I could close the SmartDashboard application, an error would appear on NetConsole and immediately the robot would recover and the messages would stop. Without SmartDashboard there was no such problem.

At our last regional (Knoxville), it was STRONGLY SUGGESTED that we install the latest version of the Driver Station software on our laptop. As soon as we did so, we started having control issues and motor safety helper warnings, even with no SmartDashboard application running.

By this time, I had put #if/#endif pairs around all of our SmartDashboard code, and we decided to disable them at the competition, but also to switch to another laptop still running the slightly older version of the Driver Station for competition as we were unable to get back on the practice field to test out the changes with the new version of the Driver Station.

The difference between the earlier 2013 and latest 2013 DS release are very minor. It is related to logging of robot data to your DS computer. None of it interacts with smart dashboard.

As with the FMS interaction, I’m providing this info so that you can look for the issue where it is most likely occurring. I cannot guarantee anything 100%, but I can tell you that in my opinion, the new DS has nothing to do with your issue and neither does the FMS.

Greg McKaskle

Yeah, there seem to be a lot of suspicious reports involving the SmartDashboard; I heard from a team or two at our regional that had to remove it to get a stable operation, and I recall seeing threads here that pointed suspicious fingers at it as well.

Cheers,

Jeremy

Just for completeness, we did run this test, and got no crashes. I also pulled a wireshark trace of our exaggerated test (which sent a lot of Network table updates), and didn’t find anything that jumped out at me. Ah well, color me stumped.

Cheers,

Jeremy