Go to Post Cooling fans? Never had the need. Then again, we're talking New Jersey in March. Radiant heaters might be a better choice, or maybe snow boots. - DonRotolo [more]
Home
Go Back   Chief Delphi > Technical > Programming > C/C++
CD-Media   CD-Spy  
portal register members calendar search Today's Posts Mark Forums Read FAQ rules

 
Reply
Thread Tools Rate Thread Display Modes
  #1   Spotlight this post!  
Unread 28-03-2013, 21:49
jwhite jwhite is offline
Registered User
FRC #2823
Team Role: Mentor
 
Join Date: Feb 2013
Rookie Year: 2013
Location: Saint Paul, MN
Posts: 69
jwhite is an unknown quantity at this point
FMS Differences with motor safety?

I'm trying to puzzle out why our robot failed to operate under FMS control today with our vision code active. We can run it tethered, and previously ran it via Wifi, without any sort of failures or odd problems. But put it on the field, under FMS, and it crashes. (We turned vision off once, and it ran fine that time).

We did find one problem: a tight motor safety value (0.1) coupled with a misplaced Wait(0.04) meant that we could (rarely) generate a motor safety warning.

And in two of our crashes today, our robot was dead with a string of motor safety warnings.

When we run in the pit, if you get a motor safety warning, your robot skips a beat, but once you go back to refreshing, it goes right back to working.

Does FMS control imply a different error mode? In other words, if you miss one motor safety check, does your 'bot get shot down?

Right now we're trying to decide whether to run without our nifty auto target code (which requires vision), or to gamble with a motor safety value of 0.5 and risk losing our first match because of it.

Anyone have any experience with differences between pit and FMS?

Cheers,

Jeremy
Reply With Quote
  #2   Spotlight this post!  
Unread 28-03-2013, 22:01
RufflesRidge RufflesRidge is offline
Registered User
no team
 
Join Date: Jan 2012
Location: USA
Posts: 989
RufflesRidge has a brilliant futureRufflesRidge has a brilliant futureRufflesRidge has a brilliant futureRufflesRidge has a brilliant futureRufflesRidge has a brilliant futureRufflesRidge has a brilliant futureRufflesRidge has a brilliant futureRufflesRidge has a brilliant futureRufflesRidge has a brilliant futureRufflesRidge has a brilliant futureRufflesRidge has a brilliant future
Re: FMS Differences with motor safety?

It sounds like as it was you were operating on razors edge of controlling the motors often enough ("a tight motor safety value (0.1) coupled with a misplaced Wait(0.04) meant that we could (rarely) generate a motor safety warning."). Operation on the field is likely adding additional delay/jitter to your control packets which may be causing an issue.

Maybe try turning the vision code on and off based on a button. Or only processing an image (or however many you need to target) when a button is pressed.
Reply With Quote
  #3   Spotlight this post!  
Unread 28-03-2013, 22:41
Andrew Lobos Andrew Lobos is offline
Registered User
FRC #0225 (TechFire)
Team Role: Mentor
 
Join Date: Feb 2011
Rookie Year: 2011
Location: Lancaster, PA
Posts: 61
Andrew Lobos is a jewel in the roughAndrew Lobos is a jewel in the roughAndrew Lobos is a jewel in the rough
Re: FMS Differences with motor safety?

Quote:
Originally Posted by jwhite View Post
Does FMS control imply a different error mode? In other words, if you miss one motor safety check, does your 'bot get shot down?
Your robot will not get shutdown for the full match because you failed to update one motor in time. The timer counts from the last set() call on the motor object. The motor safety watchdog is implemented solely on the cRIO and is dependent on how often your code updates the motor.

The other watchdog on the cRIO is dependent on how often packets are received from the driverstation. If packets take too long to reach the robot, the robot will switch to disabled mode until it begins receiving packets again. I'm not 100% sure, but I think the motor safety timer will still count in disabled mode. If you are lagging on the field, the robot being disabled could be causing your motor timeout messages.

My other thought is that your vision code may be taking too much time to execute. Are you doing your vision processing on the cRIO in the same thread as your motor code?

It might also be beneficial to open NetConsole while you are on the field and watch the output from the cRIO see why the code is crashing.
Reply With Quote
  #4   Spotlight this post!  
Unread 28-03-2013, 23:10
jwhite jwhite is offline
Registered User
FRC #2823
Team Role: Mentor
 
Join Date: Feb 2013
Rookie Year: 2013
Location: Saint Paul, MN
Posts: 69
jwhite is an unknown quantity at this point
Re: FMS Differences with motor safety?

Yeah, the drivers were supposed to run with netconsole on, but it was among a variety of things that didn't get done :-/. That'll be something we hopefully remember tomorrow.

The vision code is all in it's own thread, and only executing 4 times / second; the FRC log viewer doesn't show extreme cpu usage (in fact, cpu usage looks quite reasonable right along).

Again, it's puzzling - this all works fine in the pit. And while we're arguably 'wrong' on the 0.1/0.04 thing, we have to work really hard to get it to trip just once in the pit, and from what you've said (and my reading of WPILib source code), that wouldn't explain what we're seeing.

What's tough is that at this point, the safe thing to do is just run without vision. Risking a 0 point match during quals is just too high risk. It's really a shame we have such limited access to the 'real' conditions - it really makes this hard.

Thanks for the suggestions; others gratefully taken. Puzzlers like this are frustrating!

Cheers,

Jeremy
Reply With Quote
  #5   Spotlight this post!  
Unread 28-03-2013, 23:23
Joe Ross's Avatar Unsung FIRST Hero
Joe Ross Joe Ross is offline
Registered User
FRC #0330 (Beachbots)
Team Role: Engineer
 
Join Date: Jun 2001
Rookie Year: 1997
Location: Los Angeles, CA
Posts: 8,576
Joe Ross has a reputation beyond reputeJoe Ross has a reputation beyond reputeJoe Ross has a reputation beyond reputeJoe Ross has a reputation beyond reputeJoe Ross has a reputation beyond reputeJoe Ross has a reputation beyond reputeJoe Ross has a reputation beyond reputeJoe Ross has a reputation beyond reputeJoe Ross has a reputation beyond reputeJoe Ross has a reputation beyond reputeJoe Ross has a reputation beyond repute
Re: FMS Differences with motor safety?

Quote:
Originally Posted by jwhite View Post
Yeah, the drivers were supposed to run with netconsole on, but it was among a variety of things that didn't get done :-/. That'll be something we hopefully remember tomorrow.
The netconsole port isn't on the list of ports passed by the field, so I don't think netconsole will work on the field.

Quote:
Originally Posted by jwhite View Post
The vision code is all in it's own thread, and only executing 4 times / second; the FRC log viewer doesn't show extreme cpu usage (in fact, cpu usage looks quite reasonable right along).

Again, it's puzzling - this all works fine in the pit. And while we're arguably 'wrong' on the 0.1/0.04 thing, we have to work really hard to get it to trip just once in the pit, and from what you've said (and my reading of WPILib source code), that wouldn't explain what we're seeing.
What does the log viewer shows as your trip time and missed packets? Can you post the logs? 0.04 is two missed packets at 50hz.
Reply With Quote
  #6   Spotlight this post!  
Unread 28-03-2013, 23:30
Andrew Lobos Andrew Lobos is offline
Registered User
FRC #0225 (TechFire)
Team Role: Mentor
 
Join Date: Feb 2011
Rookie Year: 2011
Location: Lancaster, PA
Posts: 61
Andrew Lobos is a jewel in the roughAndrew Lobos is a jewel in the roughAndrew Lobos is a jewel in the rough
Re: FMS Differences with motor safety?

Have you tried running the robot in the pit using practice mode? If you're only testing teleop in the pit, you won't be affected if your robot has a bug in autonomous, while the field will go through both states.

Also, while I'm not sure what difference this would make, the DS won't necessarily connect to the robot immediately after boot. You need to wait for the radio to boot and the field to be configured for your team #. To emulate this in the pit, turn on the robot with the DS unplugged, wait for your camera and cRIO to boot, then connect the DS and run in practice mode.
Reply With Quote
  #7   Spotlight this post!  
Unread 29-03-2013, 00:24
jwhite jwhite is offline
Registered User
FRC #2823
Team Role: Mentor
 
Join Date: Feb 2013
Rookie Year: 2013
Location: Saint Paul, MN
Posts: 69
jwhite is an unknown quantity at this point
Re: FMS Differences with motor safety?

Quote:
Originally Posted by Joe Ross View Post
The netconsole port isn't on the list of ports passed by the field, so I don't think netconsole will work on the field.

Ah, rats.


What does the log viewer shows as your trip time and missed packets? Can you post the logs? 0.04 is two missed packets at 50hz.
No missed packets in the log at the point of failure (some, but 30 seconds previous). Trip times are 3 ms, so quite good as well.

But if we were disabled for a network drop out, wouldn't the log viewer show that failure first, and then the motor safety failure?
Reply With Quote
  #8   Spotlight this post!  
Unread 29-03-2013, 00:29
jwhite jwhite is offline
Registered User
FRC #2823
Team Role: Mentor
 
Join Date: Feb 2013
Rookie Year: 2013
Location: Saint Paul, MN
Posts: 69
jwhite is an unknown quantity at this point
Re: FMS Differences with motor safety?

Quote:
Originally Posted by 4ndr3wl View Post
Have you tried running the robot in the pit using practice mode? If you're only testing teleop in the pit, you won't be affected if your robot has a bug in autonomous, while the field will go through both states.

Also, while I'm not sure what difference this would make, the DS won't necessarily connect to the robot immediately after boot. You need to wait for the radio to boot and the field to be configured for your team #. To emulate this in the pit, turn on the robot with the DS unplugged, wait for your camera and cRIO to boot, then connect the DS and run in practice mode.
Yeah, we spent an hour running it in practice mode, and we'd run it through practice mode several times previously. We forcibly sent more network table updates (we have two things; vision code, and then we use network tables to send that target info up to a smart dashboard Java plugin). We built fake vision targets to see if we could swamp the vision processing code (that was what got us able to trip over to 0.12, and get a motor safety to fire...once). We desperately tried to get it to fail off the field and couldn't :-(.

Cheers,

Jeremy
Reply With Quote
  #9   Spotlight this post!  
Unread 29-03-2013, 01:35
jwhite jwhite is offline
Registered User
FRC #2823
Team Role: Mentor
 
Join Date: Feb 2013
Rookie Year: 2013
Location: Saint Paul, MN
Posts: 69
jwhite is an unknown quantity at this point
Re: FMS Differences with motor safety?

Attached are the logs; forgive me if they are not in the right format.

fail1 and fail3 are similar failures; they show the motor timeout error. fail2 there was some uncertainty on; the programmer thought the driver unplugged the joystick, the driver claimed otherwise. good1 is a run without any vision or network table code (and there were no issues, hence the title 'good1').

The vision errors at start up are, so far as I know, 'normal'; we always get them during vision acquisition processing.

Cheers,

Jeremy
Attached Files
File Type: zip logs2823.zip (67.0 KB, 10 views)
Reply With Quote
  #10   Spotlight this post!  
Unread 29-03-2013, 10:02
Greg McKaskle Greg McKaskle is offline
Registered User
FRC #2468 (Team NI & Appreciate)
 
Join Date: Apr 2008
Rookie Year: 2008
Location: Austin, TX
Posts: 4,752
Greg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond repute
Re: FMS Differences with motor safety?

The first log shows a number of vision errors at the very beginning, presumably the camera is not completed booting and in the timeout case, the image processing fails because the image isn't fully initialized.

About 38 seconds into the teleop period, the code doesn't crash, but it does start producing Safety errors every 100ms. I'm not as familiar with the C++ implementation of the safety system, so I can't tell if this would occur if the errors will be sent repeatedly if the motors are never updated. Based on the pattern of the CPU, the image processing thread was operating normally, but it appears that the thread that calls RobotDrive or Motor functions for drive may have crashed or hung or gone into through a conditional code path where it failed to ever update the motors.

Log 2 does contain an error message at 17 seconds into teleop indicating that a joystick disappeared or started returning errors. Typically this is due to being unplugged. Make sure the drivers know that if you plug in a joystick during a match, you must press F1 for it to be recognized. While driving the robot, it isn't possible to poll for joysticks automatically or this would interfere with driving.

Log 3 shows a similar stream of safety error messages, but this time starting at 28 seconds into teleop. Again, based on CPU pattern, it seems that vision is still running and processing images.

Log 4 indicates that vision wasn't used. Nothing interesting here.

I'm afraid there is nothing else in the logs to narrow what is causing the issues other than the time at which the error started. If the drivers or coach remember initiating a certain action on the robot -- for example, when they hit button 7 on the joystick, then I'd look into that code or do a test with vision enabled and go through all of the special modes trying to exercise all code paths.

As for the FMS interaction, the FMS whitepaper goes into more detail, but the FMS never directly communicates with the robot. Your robot has great comms the entirety of each match, I do not see any indication the robot was ever disabled except by the safety mechanism. Also, the tracing code, which produces one of the blue traces at the top of the Log file viewer indicates that the teleop packets were still being processed and wherever that trace call is in your code was still being called at 20ms intervals. You can zoom in and see that those traces are in fact many small dots, each indicating what packet was sent and which code traces were executed in response.

Greg McKaskle
Reply With Quote
  #11   Spotlight this post!  
Unread 29-03-2013, 23:23
jwhite jwhite is offline
Registered User
FRC #2823
Team Role: Mentor
 
Join Date: Feb 2013
Rookie Year: 2013
Location: Saint Paul, MN
Posts: 69
jwhite is an unknown quantity at this point
Re: FMS Differences with motor safety?

Quote:
Originally Posted by Greg McKaskle View Post
As for the FMS interaction, the FMS whitepaper goes into more detail, but the FMS never directly communicates with the robot. Your robot has great comms the entirety of each match, I do not see any indication the robot was ever disabled except by the safety mechanism. Also, the tracing code, which produces one of the blue traces at the top of the Log file viewer indicates that the teleop packets were still being processed and wherever that trace call is in your code was still being called at 20ms intervals. You can zoom in and see that those traces are in fact many small dots, each indicating what packet was sent and which code traces were executed in response.

Greg McKaskle
Thanks for the detailed and careful analysis. Your conclusions match mine. Occam's razor would suggest that our main loop code either crashes or hangs up at the failure mark of those two logs.

However, we could not reproduce this unless we were on the field. It may be the elements, it may be the subtly additional latency introduced by FMS; we really can't figure it out.

A 525 mentor told me that they ripped out all of the network table and smart dashboard code they had on their C++ 'bot, because they experienced similar problems in Duluth. And, nicely, after we removed all of that code, we had no such issues today.

Steve Peterson has graciously offered to review our code and see if they can reproduce the issue at a less stressful time. And I have to say the field guys were very helpful and supportive. As a rookie mentor, I couldn't have been happier with their response to our woes.

Sadly, we spent today ironing out the other problems that would have been nice to find yesterday, but hopefully tomorrow will be an even better day, and we'll get to play in elims :-)

Cheers,

Jeremy
Reply With Quote
  #12   Spotlight this post!  
Unread 30-03-2013, 17:45
Greg McKaskle Greg McKaskle is offline
Registered User
FRC #2468 (Team NI & Appreciate)
 
Join Date: Apr 2008
Rookie Year: 2008
Location: Austin, TX
Posts: 4,752
Greg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond repute
Re: FMS Differences with motor safety?

Please identify the loop/thread that is calling the teleop trace function. It is clear from the logs that thread was active. It is pretty clear that the vision thread was active. This may help reconstruct what took place.

As for whether the Network Table communications has anything to do with it or the field, be sure to keep you eyes open to other things as well.

Greg McKaskle
Reply With Quote
  #13   Spotlight this post!  
Unread 01-04-2013, 08:50
jwhite jwhite is offline
Registered User
FRC #2823
Team Role: Mentor
 
Join Date: Feb 2013
Rookie Year: 2013
Location: Saint Paul, MN
Posts: 69
jwhite is an unknown quantity at this point
Re: FMS Differences with motor safety?

Attached is our code, along with the patch that 'fixed' everything (by removing Vision and network tables). Nicely, we did go on to compete well, even putting up our best performance in a 106 point QF match (sadly, we lost the QF due to non software glitches, but it was still a great weekend!).

I spent a bit more time looking at the code, and found one other interesting lead. We sent the current speed of our wheel (if it changes) using NetworkTables every 10 ms. That seems like the sort of thing that could trigger a problem.

So, the summary as I understand it is as follows:
We have 4 or 5 running threads: 2 for vision, 1 for the Driver Station (actively looping inside the Run() method of the DS code - hence the log updates), and a main thread, running inside our OperatorConsole() method. We may also have a thread for our shooter wheel, which is a pid controller thread.

The vision and DS threads all appear to continue working the whole time; the log suggests that. At the failure point, the main thread appears to die. Probably as a side effect of that, the DS thread then goes on to trigger the motor safety code and shut down the motors.

There is one known bug: in a target rich environment, the vision code consumes enough CPU that a mistaken Wait(0.04) in the shooting code can cause us to exceed the 0.10 timeout and generate a motor safety error.

My initial guess (and hence this thread title) was that the FMS system would shut down our robot on detecting a motor safety error; that guess appears to be quite wrong.

Occam's razor suggests that the logical explanation is that there is a bug in our OperatorCosole() code that causes our main thread to crash. I cannot for the life of me find any such bug.

If I range further afield, my next hunch is that this line of code:
distanceTable->PutNumber("speed",shootEncoder.GetRate());
if called every 10 ms can lead to a crash of some kind.

Perhaps we couldn't reproduce it because the FMS network conditions are subtly different. Perhaps we couldn't reproduce it because we tended not to run the shooter wheel all the time during our testing (the put only happens if the shooter wheel is running). Perhaps it only happens if we're doing the PutNumber calls *and* a call to setErrorData is happening in a different thread. (I don't have the source code for setErrorData, so I can't examine that possibility).

The only other faint leads I have are these. We did experience a bug such that if you first did a PutNumber of a variable, and then later did a PutBoolean of that same name, you would get a crash. We also had a superstition that doing too many PutNumbers in a row would lead to a crash (although we were never able to confirm that superstition, and it was muddled with the Boolean/Number crash and the mandatory C++ update that was supposed to fix all kinds of network table crashes).

At any rate, thanks for listening. And thanks to Google for archiving all of these thoughts, in the off chance that it might help some other poor soul <grin>.

Cheers,

Jeremy
Attached Files
File Type: zip team2823-code-2013.zip (25.3 KB, 8 views)
Reply With Quote
  #14   Spotlight this post!  
Unread 02-04-2013, 07:43
Greg McKaskle Greg McKaskle is offline
Registered User
FRC #2468 (Team NI & Appreciate)
 
Join Date: Apr 2008
Rookie Year: 2008
Location: Austin, TX
Posts: 4,752
Greg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond reputeGreg McKaskle has a reputation beyond repute
Re: FMS Differences with motor safety?

Network Tables doesn't really define what the outcome is supposed to be when you change datatypes for a given variable name. I wouldn't expect it to crash, but it is almost certainly a bug when you do this and the various network table implementations are free to return whatever value they feel like.

I still fail to see how the FMS could provoke this. It is more likely that the shiny field or heat of the match are contributing. Your logs show your lag and it is not bouncing around causing timing glitches.

If it was your job, or your programming test, to find this bug and explain the failure, what would you do?

Greg McKaskle

Last edited by Greg McKaskle : 02-04-2013 at 07:44. Reason: add test
Reply With Quote
  #15   Spotlight this post!  
Unread 02-04-2013, 08:38
jwhite jwhite is offline
Registered User
FRC #2823
Team Role: Mentor
 
Join Date: Feb 2013
Rookie Year: 2013
Location: Saint Paul, MN
Posts: 69
jwhite is an unknown quantity at this point
Re: FMS Differences with motor safety?

Quote:
Originally Posted by Greg McKaskle View Post
I still fail to see how the FMS could provoke this. It is more likely that the shiny field or heat of the match are contributing. Your logs show your lag and it is not bouncing around causing timing glitches.

If it was your job, or your programming test, to find this bug and explain the failure, what would you do?
Greg McKaskle
Sorry, if I wasn't clear. I see no reason to think that FMS or the field are the problem. I think the problem is somewhere on the Robot. However, all of our testing so far only triggers while on the field. So a forensic analysis tries to understand what possible difference there could be on the field that would trigger a problem. And I agree, the best suspect is the target rich environment. But as you noticed, the camera threads all appear to stay alive, and the main loop code has very little overlap with camera information. It's hard to connect those dots. The difference in latency or bandwidth would appear to be an unlikely trigger (you have to imagine that 2 ms versus 1 ms makes a difference), but I have no other theories, so I mention it.

I hope to run a test to see if generating error messages at the same time Network Tables traffic is being sent can cause a crash; that would test my first hunch. The other step I would take is to put wireshark on while it operates to see if that generates any further clues.

If I had infinite time, I'd take the whole WPILib stack and run it under a tool like Valgrind while also operating our robot to see if I could shake free any bugs in that code. That Network tables code was already the culprit in a nasty set of crashes once this season, and it is not trusted by other mentors I've spoken with. It's a sufficiently complex body of code that simple inspection isn't enough to identify any problems.

I don't know enough about the VxWorks environment to know if there is a Valgrind like tool available on the Robot.

Beyond that, though, without the ability to reproduce the problem, it becomes difficult to solve the issue.

Also, note that our robot is done and has no further need for this capability. I'm now just trying to solve the puzzle for my own edification, and for the possibility that it might help another team.

Cheers,

Jeremy
Reply With Quote
Reply


Thread Tools
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -5. The time now is 03:22.

The Chief Delphi Forums are sponsored by Innovation First International, Inc.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
Copyright © Chief Delphi