Chief Delphi

Chief Delphi (http://www.chiefdelphi.com/forums/index.php)
-   General Forum (http://www.chiefdelphi.com/forums/forumdisplay.php?f=16)
-   -   706 had the curse of all curses (http://www.chiefdelphi.com/forums/showthread.php?t=93536)

Bruceb 13-03-2011 14:27

706 had the curse of all curses
 
Hello to all the great teams at the Milwaukee Regional. I am sure most of you were aware that 706 was having unpresidented problems at Milwaukee this weekend. Not looking for simpathy here, just going to list our problems and maybe get feedback on what we may have done wrong.
Our problems started at the mini regional at Sussex 2 weeks prior and because we could not find the issue there we still had them at Milwaukee.
Problem 1.
Frequent (every 15-30 seconds) we would loose comm. Classmate would report no robot code/no robot communication. This would last fe 1-5 seconds. We never did figure this problem out.
Problem 2
Our solenoid break out boxes would glitch(the lights would go out and the solenoids would loose power for .5-2 seconds). This started out as a minir annoyance happenning every minute or so but by Friday morning had turned into a major problem happening every couple of seconds. This was finally corrected by replacing the classmate.
Problem 3.
Our autonomous code worked perfectly in the pit and on the practice field but on the field it would quit right after we lifted the ubertube to the top and then when the field shifted to tele we would not get control for 30 seconds or so. Fixed this by giving up on the autonomous. Still don't understand it as we have no way to test on the field. One of our mentors thinks he may have found some coding issue for the failure to move forward but not the delay in tele.
We were trying defferent thins durring every break we had. We had all th experts from FIRST and NI in our pits constantly trying to help figure out these problems. We replaced the PDB. Nope We replaced the CRIO. No fix.
We replaced the main breaker as the techs were seeing a major voltage drop (down to 9 volts at times) even when we were not moving. Not that either. We replace the 5 volt converter. Not that either. We powered the CRIO from an ac power supply. Still no help.
We had a reasonably good minibot and deploy so we were luckily picked as the second pick of the number 8 seed so we went to the prom.
Right before the match started the thought was that the only thing we have not replaced was the solonoid brakout boxes. We had 2 both running on 12 volts. Time we tight and while doing other things we enlisted the help of an NI person to replace those. Just got that done in time to get to the field so we had no time to test..BAD Got to the game and none of the controls for our arm were working right. Worst of all the pneumatic ratchet for our winch would not release so we could only go up.
Turns out that while the breakout boxes were being replaced some of the wires to the solenoid valves had gotten unpluged from the adapter on the breakout box and had gotten plugged back in backwards.
Didn't get that fixed in time for the second match so we had to call a sub for us.
Last but not least, while still trying to shoot the manipulator problem our driver was looking at the minibot adn swithed it on but nothing happened. I looked at it as I had just rplaced the battery. Turns out that one of the pins in the plug had pushed out so it was not making contact.
All I could do at that time was look at him and laugh.
WHAT ELSE !!!!!!!!!!!!!!!!!

Hope we have this jinx busted before next year.

Best of luck to you all.

Bruce

DonRotolo 13-03-2011 15:07

Re: 706 had the curse of all curses
 
My condolences.

We had a similar issue, seen in practice and competition. We were fortunate to have Greg from NI at the NJ regional take a close look at things, and his conclusion was that our software was overtaxing the cRio. That is, when we kept the cRio too busy, after a while it eventually had to stop and do some housekeeping, which is when the outputs dropped (exactly what you are seeing).

To help solve this, we removed all unnecessary code (like the vision processing and several debugging routines). We're now examining our PID loops to make sure they are running at reasonable rates (perhaps 10 Hz instead of the max the CPU can support). We're also checking all conditionals and wait states,since while we force the cRio to wait it can't do ANYthing...

Perhaps a careful review of your code may be of value.

XaulZan11 13-03-2011 15:16

Re: 706 had the curse of all curses
 
Probably the biggest surprises of the event, I thought, was how much 706 struggled. After a very strong finalist preformance last year, and what looked like another strong robot, I expected you guys to be one of the best robots at the event. Hopefully you get this all figured out next year and ready to be a top team again.

Kevin Ray 13-03-2011 15:26

Re: 706 had the curse of all curses
 
Man, I feel your pain. We've had two regionals like that. They happened on two different years and they were enough to make me say, "That's it, I quit". I said it both times, in '03 and '04--needless to say I'm still here. The one thing I've learned from those times is that they were the most beneficial experiences for following years. I know that that is the last thing you want to hear--perfunctory motivational comments--but it's true.
We now ALWAYS threadlock screws, support BOTH sides of output shafts from motors and gearboxes, have two people complete a checklist prior to each match etc... These and a myriad of other measures have kept Murphy [law] at bay. These are the seasons that make the wins taste so much better--and believe me, they will be delicious!!!!

Ether 13-03-2011 15:40

Re: 706 had the curse of all curses
 
Quote:

Originally Posted by DonRotolo (Post 1038599)
while we force the cRio to wait it can't do ANYthing...

That's not strictly true is it? Even if you are spinning in a loop polling the clock or some hardware or variable, other threads at the same priority get time-sliced do they not? And any higher-priority tasks that become runnable (due to say some hardware event) will preempt. The OS can preempt if it has important work to do.

Better yet, don't spin in a polling loop. Use blocked waiting (or whatever it's called in your language). This releases the processor to go do other work until whatever resource or event you are waiting for becomes available.



DonRotolo 13-03-2011 15:56

Re: 706 had the curse of all curses
 
Depends on how you implement it. Good practice for a 500 mSec "wait" is to look at the clock every so often and when it's advanced that 1/2 second, you stop waiting. But it certainly is possible to have the entire system come to a grinding halt to literally wait.

Guess what we did?

boomergeek 13-03-2011 16:16

Re: 706 had the curse of all curses
 
"That's not strictly true is it?"

My understanding is there are several tasks that run "timeshared" with the user program. The default teleopcontinuous() method provides a 1 millisecond task delay every cycle.

If a team removes this task delay by defining a different teleopcontinuous() in an attempt to run their code faster, then they are actually reduce the overall performance because the operating system ends up with fewer chances to run both higher and lower (and the same) priority tasks.

The complexity and processing that some teams attempt to put in their code can overstress the CRio. The library of canned code available to monitor the performance of the system is sort of non-existent.

Ten years ago, I did some time as a performance SW developer responsible for assuring fiber optic telecommunications systems would be able to always switch within 25 ms of fiber cuts and other failures under VXWorks.

Anyone that has spent time in performance software can tell you that time logging of critical events and measuring the CPU resource being used up, monitoring queues of messages between tasks by all the processes in the system is critical to understanding how close to overloading a system is operating.

I'm guessing that more experienced teams that don't actually log/monitor system performance have developed rules of thumb for themselves to know if they are in the safe zone of loading a system down without overloading it.

I'm toying with the idea of getting team 241 to create a library of performance and logging tools and then share it with the FIRST community.

Then during practice a team can determine if it is coming close to using up scarce resources.
And if a special problem occurs during a field play that never showed up during practice, then a file logging of all the key parameters can be used to diagnose "misbehavior".

It would be a lot of work, but looking at perplexed teams at Regionals and reading here on CD, I perceive a serious need.

Ether 13-03-2011 16:41

Re: 706 had the curse of all curses
 
Quote:

Originally Posted by DonRotolo (Post 1038628)
Good practice for a 500 mSec "wait" is to look at the clock every so often and when it's advanced that 1/2 second, you stop waiting.

It sounds like what you are describing is this: something happens in TeleOp (for example) which necessitates waiting for 500ms before taking some action.

Instead of sitting there waiting for 500ms to elapse (which would indeed cause the rest of your TeleOp code to freeze... but not other threads at the same or higher priority), you proceed with the rest of the TeleOp code. Each pass through TeleOp you check to see if the 500ms has elapsed and, if so, you execute the code in question. Is that what you meant?

Quote:

But it certainly is possible to have the entire system come to a grinding halt to literally wait.
I can see where the thread that is waiting would come to a grinding halt, and possibly make it look like the entire system has stopped. But other threads at the same or higher priority should continue to execute.




Ether 13-03-2011 16:48

Re: 706 had the curse of all curses
 
Quote:

Originally Posted by boomergeek (Post 1038642)
the operating system ends up with fewer chances to run both higher and lower (and the same) priority tasks.

My understanding of how Java, for example, is implemented on Windows platforms is that the highest priority runnable thread runs to completion and is not preempted by lower-priority threads. If there are multiple runnable threads at the same (highest) priority, they will be time-sliced.

Are you saying that Java does not behave this way on the cRIO?



boomergeek 13-03-2011 19:42

Re: 706 had the curse of all curses
 
My understanding is that in a realtime operating system (such as VXWorks), a particular task runs until the system tick is complete or a system call is made that allows preemption (to allow a higher priority task to run). Lower priority tasks can run if the task gives up the processor (e.g,. taskdelay system call).
This is a function of the realtime OS and not whether Java or C++ is the programming language.

Ether 13-03-2011 21:38

Re: 706 had the curse of all curses
 
Quote:

Originally Posted by boomergeek (Post 1038759)
My understanding is that in a realtime operating system (such as VXWorks), a particular task runs until the system tick is complete or a system call is made that allows preemption (to allow a higher priority task to run). Lower priority tasks can run if the task gives up the processor (e.g,. taskdelay system call).
This is a function of the realtime OS and not whether Java or C++ is the programming language.

Java (on Windows platforms) always runs the runnable thread with the highest priority. Lower priority threads run only if there are no runnable threads of higher priority. If there is more than one runnable thread at the same priority as the presently running thread, then Java (on Windows platforms) time-slices those threads to run them concurrently. I was wondering if anyone knows definitively if it works the same way on the cRIO with vxworks.



boomergeek 13-03-2011 23:11

Re: 706 had the curse of all curses
 
Quote:

Originally Posted by Ether (Post 1038875)
Java (on Windows platforms) always runs the runnable thread with the highest priority. Lower priority threads run only if there are no runnable threads of higher priority. If there is more than one runnable thread at the same priority as the presently running thread, then Java (on Windows platforms) time-slices those threads to run them concurrently. I was wondering if anyone knows definitively if it works the same way on the cRIO with vxworks.


I don't have access to anything definitive about CRio, but all implementations of VXWorks I am familair with use "system ticks" and a preemptive kernel.

http://www.embeddedheaven.com/real-time-os-basics.htm

Preemptive kernels
In preemptive systems, the kernel schedules is called with a defined period, each tick. Each time it is called it checks if there is a ready-to-run thread which has a higher priority than the executing thread. If that is the case, the scheduler performs a context switch. This means that a thread can be preempted – forced to go from executing to ready state – at any point in the code, something that puts special demands on communication between threads and handling common resources.

Using a preemptive kernel solves the problem where a high priority thread has to wait for a lower priority thread to yield the processor. Instead, when the high priority thread becomes ready to run, the lower priority thread will become preempted, and the high priority thread can start to execute.

Most commercial RTOSes support preemption.
From this, you can expect that a running lower priority task will continue to run for up to a system tick even if other higher priority tasks become ready to run. (That is one reason why you have to be careful about not wasting processing time even in low priority tasks: it will have impacts on the how often the higher priority tasks get a chance to run, especially if the higher priority tasks make system calls that produce wait states very often within the high priority tasks.)

petet4 14-03-2011 00:44

Re: 706 had the curse of all curses
 
A little more detail on the cRio issue. I'm the software mentor for team 706 and in the process of trying to debug the intermittent cRio mini resets, which I now understand as a 2nd watchdog within the cRio. We created a new project and created a telop loop consisting of 2 lines. Where shftin is a solenoid output. When telemode was enabled within a couple of minutes the solenoid LED would intermittently turn off for about .25sec.

void OperatorControl(void)
{

while (IsOperatorControl())
{
shiftin.Set(true);
Wait (0.01);
}
}

There was no predicable rate it would go out. Sometimes it would run for minutes with no issues and sometimes it would go out several times with 10-20 seconds.

One thing that would make things better is to use a different classmate but the problem never went away. The first classmate which showed the problem most often was running 23 processes and about 0.85% network usage and ~25% CPU usage. The 2nd classmate when worked better had 47 processes running with about the same network and CPU usage as the first.

The software on the classmate was uninstalled and re-installed which did make any difference. The were many other things tried. When we get robot back I plan to analyze the network traffic between the cRio and the driver station to get to the root cause of this issue.

-pete

boomergeek 14-03-2011 09:24

Re: 706 had the curse of all curses
 
Quote:

Originally Posted by petet4 (Post 1039049)
A little more detail on the cRio issue. I'm the software mentor for team 706 and in the process of trying to debug the intermittent cRio mini resets, which I now understand as a 2nd watchdog within the cRio. We created a new project and created a telop loop consisting of 2 lines. Where shftin is a solenoid output. When telemode was enabled within a couple of minutes the solenoid LED would intermittently turn off for about .25sec.

void OperatorControl(void)
{

while (IsOperatorControl())
{
shiftin.Set(true);
Wait (0.01);
}
}

There was no predicable rate it would go out. Sometimes it would run for minutes with no issues and sometimes it would go out several times with 10-20 seconds.

One thing that would make things better is to use a different classmate but the problem never went away. The first classmate which showed the problem most often was running 23 processes and about 0.85% network usage and ~25% CPU usage. The 2nd classmate when worked better had 47 processes running with about the same network and CPU usage as the first.

The software on the classmate was uninstalled and re-installed which did make any difference. The were many other things tried. When we get robot back I plan to analyze the network traffic between the cRio and the driver station to get to the root cause of this issue.

-pete


Is it possible that you have a mix of 2010 or earlier code and some 2011 code?
The user WatchDog by default in 2011 code is NOT enabled. MotorSafety class is used instead. Refer to the WPI Robotics Library
User’s Guide.

If your initiialization was from 2010 and enabled watchdogs, then it should be "fed" periodically in your OperatorControl loop.
Look at an example from 2010:
http://code.google.com/p/chopshop-16...6495b511d cce


"WPI Robotics Library User’s Guide":
Note: The user watchdog timer is being replaced with the MotorSafety interface. MotorSafety is the preferred way to provide safe operation of motors on your robot. By default,
the user watchdog is disabled (changed from 2010) and the MotorSafety timeouts are enabled by default on the RobotDrive object and not enabled by default on Victors and Jaguars.

petet4 14-03-2011 11:06

Re: 706 had the curse of all curses
 
Our code started with the LineFollower example supplied for 2011. In this example there is no teleop code. The 2 lines to turn on the solenoid and wait were added to the teleop loop. So I don't believe we used a mix of 2010 & 2011 code unless something is wrong in the Windriver environment. This was something we tried in an effort to troubleshoot this issue. All updates were performed.

I believe it has to be something in the network communications traffic bewteen the Classmate the cRio as things would change when a different Classmate was used.

-pete

AllenGregoryIV 14-03-2011 11:35

Re: 706 had the curse of all curses
 
I am no expert in this field or anything but one of the first things I would have tried would have been to change the radio. I didn't notice any mention of replacing or re-flashing the radio. We had weird issues at our practice field with the new radio, we had to switch to last years setup for practice. The new radio worked fine for us at competition.

Bruceb 14-03-2011 11:47

Re: 706 had the curse of all curses
 
Ya I forgot to mention that. The radio was changed Saturday with a freshly flashed one. No difference in the lost comm problem.
Pete will chime in here. He knows more than I do but I think we tried EVERYTHING we could think of. I didn't even get to see any of the other pits.
Bruce

NullEntity 14-03-2011 12:11

Re: 706 had the curse of all curses
 
Did you try reimaging the cRIO and the Classmate?

petet4 14-03-2011 16:09

Re: 706 had the curse of all curses
 
The list of things tried / changed was numerous,
-radio was swapped plus bypassed.
-cRio was re-imaged along with a different cRio
-the classmate software was un-installed then re-installed (we did not do a total re-image of the classmate yet)
-all Ethernet cables where replaced
- camera was disconnected both power and Ethernet
- all modules in the cRio were swapped
- a different PD was tried
- a different radio power module was swapped in
- all fuses where removed to all other electronics when the small test program was tried.

you name it we pretty much tried it in the amount of time we had.

As I mentioned when the robot comes back home we will digging in to the root cause of this issue, I feel it has something to do with the TCP/IP packets between the driver station and the cRio.

-pete

KStout#3536 14-03-2011 16:23

Re: 706 had the curse of all curses
 
we had the same problem as your problem 1. we found out that our classmate didnt have the newest update on it, once we updated it, everything worked peachy haha

Bruceb 14-03-2011 17:10

Re: 706 had the curse of all curses
 
I think we needed to try an exorcism. Seriously.
Grr wish we could have figured it out. I think we had a contending robot with a good mini and better than average deploy. Even beat wildstang to the top of the pole once. Now that's an accomplishment.

I'll be baak.

Bruce

billbo911 14-03-2011 17:29

Re: 706 had the curse of all curses
 
Before I start, I want to say, "I know your pain". As has been stated already, we learn the most in these situations.

Just a bit of philosophy here: It is not your circumstances that define you, it is how you respond to them.

It feels good to say that, but SUCKS when you are the one hearing it.:o

One more thought on the issues you were having. I didn't notice anywhere that you checked or changed out any motors. By chance, are you running any 775's?

Eagleeyedan 14-03-2011 17:32

Re: 706 had the curse of all curses
 
You guys still have an eager safety captain who wants to learn more! I was teaching him some stuff so that's one good thing that happened!

Hjelstrom 14-03-2011 17:38

Re: 706 had the curse of all curses
 
We had similar "random" unresponsive controls and strange bugs in autonomous which we couldn't explain. Sometimes only one side of the drivetrain would work, or the arm might extend but not lift, the robot might sit still for a random amount of time and then come to life, etc. It turned out to be our arm motor (banebots I think) which had one pole of the motor internally shorted to the motor's case.

It took us 4 or 5 days of hard troubleshooting to find. We tried everything similar to what you describe and more including swapping out the cRio with one from a neighbor school that no longer competes, swapping the power distribution board, rewiring most of the robot, pouring over every line of code many times over, etc.

Bruceb 14-03-2011 19:21

Re: 706 had the curse of all curses
 
Other than CIMS the only motor we used was the FP in the FP gearbox.
Bruce

bscharles 14-03-2011 22:32

Re: 706 had the curse of all curses
 
Team 930 had similar problems. We had weird errors for that and were not able to drive the robot for half of our matches. One thing that happened was that during one match, the robot (controls) worked completely fine. No software was changed. All we did was bring the robot back to the pit, finish some mechanical details, and sent it out for the next match. During the next match, however, we had watchdog errors and could not drive the robot. We pulled off the controls (crio, radio, and jaguars) before shipping, and will be sure to test all of these to see if they were causing the problems. (one major factor we found was the classmate. Running tethered in the pits, *most of the errors went away after switching to a different laptop)

boomergeek 14-03-2011 22:52

Re: 706 had the curse of all curses
 
One interesting observation that we made this week on our practice robot is that while attempting to download new code on a tether, we kept getting errors.
We would reboot the CRio and it would seem alright by its LED sequence but we finally noticed that the Ethernet port on the CRio would not light either of its LEDs.
We replaced with a fresh battery and everything was fine.
I only point this out because it seems that in our case, the Ethernet port on the CRio need more voltage or current than the other functions of the CRio.

If this is actually consistently the case, then it is very possible that low voltage can first lead to communication errors without indications of other failures within the motors/contollers/CRio, etc.

Has anyone done any intentional brownout testing to determine the order in which control fails as voltage is slowly lowered?

Combine this with a intermittently shorting motor and you have a difficult quandary to diagnose.

MemberTeam2029 14-03-2011 23:48

Re: 706 had the curse of all curses
 
Did Team 706 use Labview, C++, or Java to control their robot?

Bruceb 15-03-2011 09:26

Re: 706 had the curse of all curses
 
C++

Al Skierkiewicz 15-03-2011 14:29

Re: 706 had the curse of all curses
 
Pete,
This has been bugging me for days. In talking this over with our software team and they are as stumped as I am. I know this doesn't fix the weekend but at this point the only way to attack this is to disable chunks of the code to try and search down where and if it is in the software. Can you try just using a simple program first to make sure. Say just run a compressor routine and see what happens. We changed all the hardware I could think of so that kind of points to code as the fault. I suppose it is possible that you have a freak Crio chassis.
For those watching, (you too Don) this robot would just blink out randomly. Sometimes at 7-10 seconds and sometimes longer and the interruption would last a second or two. Not the normal watchdog issue we see/saw frequently as described earlier. Just seems that everything shuts down. They even tried running the Crio from a bench supply.

boomergeek 15-03-2011 16:24

Re: 706 had the curse of all curses
 
Have you tried loading the BuiltinDefaultCode robot example shipped with the CRio?
If that shows the problem, then it's obviously not the software.

petet4 16-03-2011 09:45

Re: 706 had the curse of all curses
 
Hi AL,
see post #13 of this thread. A new project was started using the default Linefollower code. In that code there is no teleop code. In this section the 2 lines were added, turn on 1 solenoid and wait 0.01. Along with these couple of lines, the init section was changed to add the solenoid in the initialization.

Running this code produced the same results, with the solenoid LED randomly turning off for brief periods when enabled.

-pete

boomergeek 16-03-2011 12:19

Re: 706 had the curse of all curses
 
Have you tried stategic placements of printf to determine if the code is leaving the OperatorControl() method possibly because the IsOperatorControl() occasionally becomes false?

Has the CRio flashing option included a reformating? (to get rid of any extraneous .out)

And because we have burned ourselves so many times by not keeping Windows->Preferences->FIRST Downloader Preferences properly up to date, we add:
printf("\n\n###################################### ####################");
printf("\n 2011 Astro Robot code version 2011_v1.0");
printf("\n Base Drivetrain #0");
printf("\n Build Date: %s at %s ", __DATE__, __TIME__);
printf("\n Clock Rate: %f /second",(double) sysClkRateGet());
printf("\n######################################## #################\n\n")

to the beginning of RobotIint(), just to be sure we know what we are running matches our intentions.

Al Skierkiewicz 16-03-2011 12:37

Re: 706 had the curse of all curses
 
Pete,
Have you just tried simple code? Like just download the compressor code, nothing else, and see if it runs normally when enabled.

olapmonkey 24-03-2016 19:36

Re: 706 had the curse of all curses
 
At 706, we have decided to go with redundant power to our radio. It seems the barrel jacks we had for our radio...one kit of parts and one from Andy Mark...possibly were both missed shipped and may have been 2.5mm instead of the specified 2.1mm. So we have replaced our barrel jack, but then not to take any chances we are also redundantly supplying the radio with power over passive PoE. We can remove either power source individually and our radio will not lose power.

Hadi379 24-03-2016 20:00

Re: 706 had the curse of all curses
 
Could you post a photo of the POE setup

Trevor1523 24-03-2016 22:22

Re: 706 had the curse of all curses
 
Not to be rude, but I actually think 1523 had the curse of all curses at Orlando a few weeks ago ;).

Your weekend sounds like a dream!

Best of luck to fixing your issues.

olapmonkey 24-03-2016 23:46

Re: 706 had the curse of all curses
 
Quote:

Originally Posted by Hadi379 (Post 1562771)
Could you post a photo of the POE setup

I'm not sure a photo will help too much as it's all buried in the guts of the robot now.

We split the Ethernet cable apart and the brown pair gets wired into the negative 12V terminal on the VRM and the blue pair gets wired into the positive terminal on the VRM. The VRM has 2 sets of 12V terminals, so we use one for the barrel jack and one for the PoE. The PoE cable is then patched from the RoboRio to the radio 12V Ethernet port...where the radio end gets the standard 4 pairs all coming in and the RoboRio end just has 2 pairs to carry just the data (on pins 1,2,3,6). So pins 4,5,7,8 at the radio are carrying the power.

Hope that helps.


All times are GMT -5. The time now is 12:53.

Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2017, Jelsoft Enterprises Ltd.
Copyright © Chief Delphi