Chief Delphi

Chief Delphi (http://www.chiefdelphi.com/forums/index.php)
-   CAN (http://www.chiefdelphi.com/forums/forumdisplay.php?f=185)
-   -   Unexplained intermittent CAN / 2CAN Jaguar problems at GSR (http://www.chiefdelphi.com/forums/showthread.php?t=93338)

John Heden 08-03-2011 20:51

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
I would like to thank everybody for their thoughts on our CAN issue and all the suggestions you have offered. Our team crafted a custom dashboard that saves all of our data to disk including any UDP NetConsole data that it happens to catch and I’ve started to work way through this information looking for possible clues. Our freeze in our first quarter final match did yield some interesting results that I’m still trying to fully understand but certainly looks like a complete loss of CAN integrity through some mechanism. The following is part of the recorded error sequence (it goes on quite a bit longer) that shows a total of 16 InitCANJaguar() calls. The only call I can find to InitCANJaguar, however, is in the constructor for CANJaguar() so I a bit perplexed at this point given we have only 5 CANJaguars. After this initialization like sequence there is a litany of getTransaction() errors before we eventually do a reset and regain control.

At this point I’m partial to the startup race condition theory of some type. I would also add that we do launch a status monitoring thread that does read information from CANJaguar at the end of our robot constructor well before the autonomous loop is initiated. This seems to work but I wonder if the 2CAN occasionally needs a bit more time to settle down before it is called upon for status and CAN transactions…

Thanks again,

John

Code>-44087 ERROR: status == -44087 (0xFFFF53C9) in getTransaction() in C:/windriver/workspace/WPILib/CANJaguar.cpp at line 425

<Code>-63194 ERROR: status == -63194 (0xFFFF0926) in InitCANJaguar() in C:/windriver/workspace/WPILib/CANJaguar.cpp at line 47

<Code>-44087 ERROR: status == -44087 (0xFFFF53C9) in getTransaction() in C:/windriver/workspace/WPILib/CANJaguar.cpp at line 425

<Code>-63194 ERROR: status == -63194 (0xFFFF0926) in InitCANJaguar() in C:/windriver/workspace/WPILib/CANJaguar.cpp at line 47

<Code>-44087 ERROR: status == -44087 (0xFFFF53C9) in getTransaction() in C:/windriver/workspace/WPILib/CANJaguar.cpp at line 425

<Code>-44087 ERROR: status == -44087 (0xFFFF53C9) in setTransaction() in C:/windriver/workspace/WPILib/CANJaguar.cpp at line 392

<Code>-44087 ERROR: status == -44087 (0xFFFF53C9) in setTransaction() in C:/windriver/workspace/WPILib/CANJaguar.cpp at line 392

<Code>-44087 ERROR: status == -44087 (0xFFFF53C9) in getTransaction() in C:/windriver/workspace/WPILib/CANJaguar.cpp at line 425

<Code>-63194 ERROR: status == -63194 (0xFFFF0926) in InitCANJaguar() in C:/windriver/workspace/WPILib/CANJaguar.cpp at line 47
Etc. etc. etc...

kamocat 08-03-2011 21:08

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
Quote:

Originally Posted by Mark McLeod (Post 1036579)
Not quite true, although in principle I agree.

On the practice field teams must unplug their DLink and replace it with the Practice field DLink.

Are you saying that the D-link will broadcast a network, even in bridge mode?

Joe:
Was that issue with starving the Ack worked out? I haven't retested it.

Ken Streeter 08-03-2011 22:23

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
Quote:

Originally Posted by Mark McLeod (Post 1036579)
Not quite true, although in principle I agree.

On the practice field teams must unplug their DLink and replace it with the Practice field DLink.

There is one other time that it is important for teams to unplug the cRIO from the DLink -- if you are in the finals and are tethering the robot to prepare it for the next match of the finals (for example, performing a system check or compressing air to have full tanks for the next match).

We ran into this problem at the Week Zero Scrimmage, so we were ready for the problem during the GSR finals. During the finals, the same teams are on the field in consecutive matches, so the field access point is still configured to communicate with the teams that were just on the field. Accordingly, as soon as the Driver Station is connected to the DLink, the DS enters the "FMS Connected" mode, forcing the robot into a disabled state and prohibiting "tethered" control.

If you find yourself in the final matches and need to tether the robot in between matches to add air to the tanks or perform any system checks, you'll want to connect the DS directly to the cRIO without going through the DLink, in order to avoid the FMS control.

jhersh 09-03-2011 03:45

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
Quote:

Originally Posted by kamocat (Post 1036743)
Joe:
Was that issue with starving the Ack worked out? I haven't retested it.

Remind me what issue that was? Are you referring to the starved token resynchronization? If so, then yes, the v28 image (and several before that) include the work-around that restarts the token synchronous to the sendMessage call where the token stream is detected to be expired.

-Joe

Mark McLeod 09-03-2011 07:30

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
Quote:

Originally Posted by Ken Streeter (Post 1036794)
There is one other time that it is important for teams to unplug the cRIO from the DLink -- if you are in the finals and are tethering the robot to prepare it for the next match of the finals (for example, performing a system check or compressing air to have full tanks for the next match).

That makes sense.
The field staff must leave the previous match teams up in FMS until the scores and penalties have been debated and submitted. I sometimes borrow one of the unoccupied player stations to test laptop link-ups in those moments (setting the laptop to one of the absent team #s).

Mike Copioli 09-03-2011 12:10

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
Quote:

Originally Posted by jhersh (Post 1036572)

I'm not sure how this is handled in the 2CAN. Mike or Omar, can you please comment?

I will give you a preliminary answer until Omar has a chance to provide a more detailed one. In short the problem is caused by a lack of sychronicity between the cRIO CAN transactions and the 2CAN dashboard transactions. This is a simple explanation of the problem, it is actually a bit more involved as Omar has explained it. Omar has written some management code that is intended to deal with this problem, however the web dashes ability to interact with the CAN bus is second chair to the user code. If the user code is sending can throttle requests to frequently, for example, the time the web dash has to interact with the bus is limited. This is not an issue with the Cross-link Control System because the 2CAN performs all synchronization and has more of a 'master' role. But again this is Omars area of expertise.

jhersh 09-03-2011 12:35

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
Quote:

Originally Posted by John Heden (Post 1036730)
At this point I’m partial to the startup race condition theory of some type.

Me too. I believe I've found and fixed the issue. We will be testing on a real field this evening and working on a plan for distributing the fix.

Please stay tuned.

-Joe

John Heden 09-03-2011 12:42

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
It sounds like the experts are convergent on the problem and we remain hopeful for a robust CAN solution. For anybody following this CAN problem thread, there was an interesting statement in the March 8 (yesterday) Team Update #17:

Quote:

If a team is using a CAN network on the robot, they should check the messages in the “Diagnostic” tab of the Driver Station before a match starts to ensure that there aren’t any scrolling CAN timeouts. If there are such messages, give the MC a “thumbs down” to show you’re not ready and click on “Reboot Robot” to restart the cRIO and clear the errors. Teams will only see such timeout errors if it's properly handled in code, and they should take care to ensure that these exceptions are handled such that they can be seen on the field.
I’m going to try to convince our team that we should maintain our CAN implementation (not go back to PWM cables) but carefully monitor for this possible problem.

Thanks again,

john

Zme 09-03-2011 13:20

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
our team wrote some code so that the robot would reboot itself automatically when/if it did not have communication with the can bus on boot, not the prettiest workaround but it works for what we needed.

techhelpbb 09-03-2011 13:27

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
Quote:

Originally Posted by jhersh (Post 1037056)
Me too. I believe I've found and fixed the issue. We will be testing on a real field this evening and working on a plan for distributing the fix.

Please stay tuned.

-Joe

Does this solution also address the problem when it doesn't happen at startup?

We get CAN issues even when the system manages to come fully online.

jhersh 09-03-2011 14:02

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
Quote:

Originally Posted by techhelpbb (Post 1037082)
Does this solution also address the problem when it doesn't happen at startup?

We get CAN issues even when the system manages to come fully online.

No... this is a start-up issue only. Can you describe as much about your setup and the behavior you see?

Thanks,
-Joe

techhelpbb 09-03-2011 14:17

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
Quote:

Originally Posted by jhersh (Post 1037098)
No... this is a start-up issue only. Can you describe as much about your setup and the behavior you see?

Thanks,
-Joe

We usually don't have connectivity issues at startup. We'll be driving and all of a sudden we'll loose one or more Jaguars off the bus with timeout or communications errors (haven't looked at the debug information myself, relying on the student feedback for the specific details).

The odd part is that in many cases these Jaguars are performing similar tasks to other Jaguars in the system, so it's not something specific to their actuators.

The Jaguar will drop off the bus, it'll come back and won't respond to further adjustments until we soft boot it. Hence we detect when they fail like this and automatically force a soft boot at this point.

Our most specific issue with this has been on the drive system. We have 2 Jaguars per side connected to CIM motors in the CIMiple gear boxes. We split the encoders and they are 100% isolated from each other and we want to run PID to target a velocity setpoint (it works fine when we don't get timeouts), but even with potentiometers we've seen this (but those Jaguars are high ratio shallow pitch worm drives). It seems to happen less when we use CAN for Vbus and hence loose the external reference, but it still does happen.

Given that we'll be fine for protracted periods of time, then suddenly experience a timeout under hard driving conditions this is what leads me to believe we have some sort of noise issue at work. Obviously if there was a spike that reached logic level on the CAN bus it would cause issues as the CAN bus is basically unmodulated single ended open collector digital. However, when using PWM, the worst you'd get is a shorter than expected pulsewidth at a frequency that is possibly wrong unless you have a periodic source of interference (unlikely in this case at the normal center frequency this system uses). So basically PWM would be more noise immune mostly because the Jaguars aren't fast enough or powerful enough electrically to instantly decompose it and overcome the load's inertia in response.

For the most part, the biggest problem we've had at startup with the cRIO using JAVA has been the bridge Jaguar just outright failing or the bus being improperly terminated. We had one Jaguar that just literally up and died once we turned it off. It was raining that day, we thought maybe water got into it somehow, but I inspected it and it was dry.

Zme 09-03-2011 14:50

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
i can provide a little more information on something that seems to have similar behavior to this problem.
We are using c++ and have had similar issues with jaguars suddenly no longer responding to commands when using a closed loop control mode.
We noticed that it generally happened when there was a fault on the jaguar (current, voltage etc). when this happened the jag stopped working for whatever reason. it was thought that perhaps the fault caused the heartbeat to time out for some reason and therefore the jag would no longer respond to commands.
the fix we tried was running back through the initialization of the jaguar, (setting pid's and enabling control) whenever we detected a fault, this seemed to alleviate the problem but didn't catch everything. we then put a button on the joystick that would run through the re-init and while it didn't solve the problem it made things bearable.

jhersh 09-03-2011 14:50

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
Quote:

Originally Posted by techhelpbb (Post 1037101)
We usually don't have connectivity issues at startup. We'll be driving and all of a sudden we'll loose one or more Jaguars off the bus with timeout or communications errors (haven't looked at the debug information myself, relying on the student feedback for the specific details).

The odd part is that in many cases these Jaguars are performing similar tasks to other Jaguars in the system, so it's not something specific to their actuators.

The Jaguar will drop off the bus, it'll come back and won't respond to further adjustments until we soft boot it. Hence we detect when they fail like this and automatically force a soft boot at this point.

Our most specific issue with this has been on the drive system. We have 2 Jaguars per side connected to CIM motors in the CIMiple gear boxes. We split the encoders and they are 100% isolated from each other and we want to run PID to target a velocity setpoint (it works fine when we don't get timeouts), but even with potentiometers we've seen this (but those Jaguars are high ratio shallow pitch worm drives). It seems to happen less when we use CAN for Vbus and hence loose the external reference, but it still does happen.

Given that we'll be fine for protracted periods of time, then suddenly experience a timeout under hard driving conditions this is what leads me to believe we have some sort of noise issue at work. Obviously if there was a spike that reached logic level on the CAN bus it would cause issues as the CAN bus is basically unmodulated single ended open collector digital. However, when using PWM, the worst you'd get is a shorter than expected pulsewidth at a frequency that is possibly wrong unless you have a periodic source of interference (unlikely in this case at the normal center frequency this system uses). So basically PWM would be more noise immune mostly because the Jaguars aren't fast enough or powerful enough electrically to instantly decompose it and overcome the load's inertia in response.

For the most part, the biggest problem we've had at startup with the cRIO using JAVA has been the bridge Jaguar just outright failing or the bus being improperly terminated. We had one Jaguar that just literally up and died once we turned it off. It was raining that day, we thought maybe water got into it somehow, but I inspected it and it was dry.

Are you certain you aren't tripping a breaker or browning out the Jaguars under high load (due to poor wiring to the power input terminals of the Jaguar)? What you described are all symptoms of the Jag rebooting.

-Joe

jhersh 09-03-2011 14:53

Re: Unexplained intermittent CAN / 2CAN Jaguar problems at GSR
 
Quote:

Originally Posted by Zme (Post 1037115)
i can provide a little more information on something that seems to have similar behavior to this problem.
We are using c++ and have had similar issues with jaguars suddenly no longer responding to commands when using a closed loop control mode.
We noticed that it generally happened when there was a fault on the jaguar (current, voltage etc). when this happened the jag stopped working for whatever reason. it was thought that perhaps the fault caused the heartbeat to time out for some reason and therefore the jag would no longer respond to commands.
the fix we tried was running back through the initialization of the jaguar, (setting pid's and enabling control) whenever we detected a fault, this seemed to alleviate the problem but didn't catch everything. we then put a button on the joystick that would run through the re-init and while it didn't solve the problem it made things bearable.

This also sounds like your Jags are rebooting due to a brown out of the input voltage. If you brown out the Jag or trip a breaker, the closed-loop settings are lost. You will have to re-initialize them (as you are) to recover the closed-loop control.

-Joe


All times are GMT -5. The time now is 04:11.

Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
Copyright © Chief Delphi