|
|
|
![]() |
|
|||||||
|
||||||||
![]() |
|
|
Thread Tools | Rate Thread | Display Modes |
|
|
|
#1
|
||||
|
||||
|
706 had the curse of all curses
Hello to all the great teams at the Milwaukee Regional. I am sure most of you were aware that 706 was having unpresidented problems at Milwaukee this weekend. Not looking for simpathy here, just going to list our problems and maybe get feedback on what we may have done wrong.
Our problems started at the mini regional at Sussex 2 weeks prior and because we could not find the issue there we still had them at Milwaukee. Problem 1. Frequent (every 15-30 seconds) we would loose comm. Classmate would report no robot code/no robot communication. This would last fe 1-5 seconds. We never did figure this problem out. Problem 2 Our solenoid break out boxes would glitch(the lights would go out and the solenoids would loose power for .5-2 seconds). This started out as a minir annoyance happenning every minute or so but by Friday morning had turned into a major problem happening every couple of seconds. This was finally corrected by replacing the classmate. Problem 3. Our autonomous code worked perfectly in the pit and on the practice field but on the field it would quit right after we lifted the ubertube to the top and then when the field shifted to tele we would not get control for 30 seconds or so. Fixed this by giving up on the autonomous. Still don't understand it as we have no way to test on the field. One of our mentors thinks he may have found some coding issue for the failure to move forward but not the delay in tele. We were trying defferent thins durring every break we had. We had all th experts from FIRST and NI in our pits constantly trying to help figure out these problems. We replaced the PDB. Nope We replaced the CRIO. No fix. We replaced the main breaker as the techs were seeing a major voltage drop (down to 9 volts at times) even when we were not moving. Not that either. We replace the 5 volt converter. Not that either. We powered the CRIO from an ac power supply. Still no help. We had a reasonably good minibot and deploy so we were luckily picked as the second pick of the number 8 seed so we went to the prom. Right before the match started the thought was that the only thing we have not replaced was the solonoid brakout boxes. We had 2 both running on 12 volts. Time we tight and while doing other things we enlisted the help of an NI person to replace those. Just got that done in time to get to the field so we had no time to test..BAD Got to the game and none of the controls for our arm were working right. Worst of all the pneumatic ratchet for our winch would not release so we could only go up. Turns out that while the breakout boxes were being replaced some of the wires to the solenoid valves had gotten unpluged from the adapter on the breakout box and had gotten plugged back in backwards. Didn't get that fixed in time for the second match so we had to call a sub for us. Last but not least, while still trying to shoot the manipulator problem our driver was looking at the minibot adn swithed it on but nothing happened. I looked at it as I had just rplaced the battery. Turns out that one of the pins in the plug had pushed out so it was not making contact. All I could do at that time was look at him and laugh. WHAT ELSE !!!!!!!!!!!!!!!!! Hope we have this jinx busted before next year. Best of luck to you all. Bruce |
|
#2
|
|||||
|
|||||
|
Re: 706 had the curse of all curses
My condolences.
We had a similar issue, seen in practice and competition. We were fortunate to have Greg from NI at the NJ regional take a close look at things, and his conclusion was that our software was overtaxing the cRio. That is, when we kept the cRio too busy, after a while it eventually had to stop and do some housekeeping, which is when the outputs dropped (exactly what you are seeing). To help solve this, we removed all unnecessary code (like the vision processing and several debugging routines). We're now examining our PID loops to make sure they are running at reasonable rates (perhaps 10 Hz instead of the max the CPU can support). We're also checking all conditionals and wait states,since while we force the cRio to wait it can't do ANYthing... Perhaps a careful review of your code may be of value. Last edited by DonRotolo : 13-03-2011 at 15:17. |
|
#3
|
||||
|
||||
|
Re: 706 had the curse of all curses
Probably the biggest surprises of the event, I thought, was how much 706 struggled. After a very strong finalist preformance last year, and what looked like another strong robot, I expected you guys to be one of the best robots at the event. Hopefully you get this all figured out next year and ready to be a top team again.
|
|
#4
|
||||
|
||||
|
Re: 706 had the curse of all curses
That's not strictly true is it? Even if you are spinning in a loop polling the clock or some hardware or variable, other threads at the same priority get time-sliced do they not? And any higher-priority tasks that become runnable (due to say some hardware event) will preempt. The OS can preempt if it has important work to do.
Better yet, don't spin in a polling loop. Use blocked waiting (or whatever it's called in your language). This releases the processor to go do other work until whatever resource or event you are waiting for becomes available. |
|
#5
|
|||||
|
|||||
|
Re: 706 had the curse of all curses
Depends on how you implement it. Good practice for a 500 mSec "wait" is to look at the clock every so often and when it's advanced that 1/2 second, you stop waiting. But it certainly is possible to have the entire system come to a grinding halt to literally wait.
Guess what we did? |
|
#6
|
||||
|
||||
|
Re: 706 had the curse of all curses
Quote:
Instead of sitting there waiting for 500ms to elapse (which would indeed cause the rest of your TeleOp code to freeze... but not other threads at the same or higher priority), you proceed with the rest of the TeleOp code. Each pass through TeleOp you check to see if the 500ms has elapsed and, if so, you execute the code in question. Is that what you meant? Quote:
|
|
#7
|
||||
|
||||
|
Re: 706 had the curse of all curses
"That's not strictly true is it?"
My understanding is there are several tasks that run "timeshared" with the user program. The default teleopcontinuous() method provides a 1 millisecond task delay every cycle. If a team removes this task delay by defining a different teleopcontinuous() in an attempt to run their code faster, then they are actually reduce the overall performance because the operating system ends up with fewer chances to run both higher and lower (and the same) priority tasks. The complexity and processing that some teams attempt to put in their code can overstress the CRio. The library of canned code available to monitor the performance of the system is sort of non-existent. Ten years ago, I did some time as a performance SW developer responsible for assuring fiber optic telecommunications systems would be able to always switch within 25 ms of fiber cuts and other failures under VXWorks. Anyone that has spent time in performance software can tell you that time logging of critical events and measuring the CPU resource being used up, monitoring queues of messages between tasks by all the processes in the system is critical to understanding how close to overloading a system is operating. I'm guessing that more experienced teams that don't actually log/monitor system performance have developed rules of thumb for themselves to know if they are in the safe zone of loading a system down without overloading it. I'm toying with the idea of getting team 241 to create a library of performance and logging tools and then share it with the FIRST community. Then during practice a team can determine if it is coming close to using up scarce resources. And if a special problem occurs during a field play that never showed up during practice, then a file logging of all the key parameters can be used to diagnose "misbehavior". It would be a lot of work, but looking at perplexed teams at Regionals and reading here on CD, I perceive a serious need. |
|
#8
|
||||
|
||||
|
Re: 706 had the curse of all curses
Quote:
Are you saying that Java does not behave this way on the cRIO? |
|
#9
|
||||
|
||||
|
Re: 706 had the curse of all curses
My understanding is that in a realtime operating system (such as VXWorks), a particular task runs until the system tick is complete or a system call is made that allows preemption (to allow a higher priority task to run). Lower priority tasks can run if the task gives up the processor (e.g,. taskdelay system call).
This is a function of the realtime OS and not whether Java or C++ is the programming language. |
|
#10
|
||||
|
||||
|
Re: 706 had the curse of all curses
Quote:
|
|
#11
|
||||
|
||||
|
Re: 706 had the curse of all curses
Quote:
http://www.embeddedheaven.com/real-time-os-basics.htm Preemptive kernels In preemptive systems, the kernel schedules is called with a defined period, each tick. Each time it is called it checks if there is a ready-to-run thread which has a higher priority than the executing thread. If that is the case, the scheduler performs a context switch. This means that a thread can be preempted – forced to go from executing to ready state – at any point in the code, something that puts special demands on communication between threads and handling common resources.From this, you can expect that a running lower priority task will continue to run for up to a system tick even if other higher priority tasks become ready to run. (That is one reason why you have to be careful about not wasting processing time even in low priority tasks: it will have impacts on the how often the higher priority tasks get a chance to run, especially if the higher priority tasks make system calls that produce wait states very often within the high priority tasks.) |
|
#12
|
|||
|
|||
|
Re: 706 had the curse of all curses
A little more detail on the cRio issue. I'm the software mentor for team 706 and in the process of trying to debug the intermittent cRio mini resets, which I now understand as a 2nd watchdog within the cRio. We created a new project and created a telop loop consisting of 2 lines. Where shftin is a solenoid output. When telemode was enabled within a couple of minutes the solenoid LED would intermittently turn off for about .25sec.
void OperatorControl(void) { while (IsOperatorControl()) { shiftin.Set(true); Wait (0.01); } } There was no predicable rate it would go out. Sometimes it would run for minutes with no issues and sometimes it would go out several times with 10-20 seconds. One thing that would make things better is to use a different classmate but the problem never went away. The first classmate which showed the problem most often was running 23 processes and about 0.85% network usage and ~25% CPU usage. The 2nd classmate when worked better had 47 processes running with about the same network and CPU usage as the first. The software on the classmate was uninstalled and re-installed which did make any difference. The were many other things tried. When we get robot back I plan to analyze the network traffic between the cRio and the driver station to get to the root cause of this issue. -pete |
|
#13
|
||||
|
||||
|
Re: 706 had the curse of all curses
Quote:
Is it possible that you have a mix of 2010 or earlier code and some 2011 code? The user WatchDog by default in 2011 code is NOT enabled. MotorSafety class is used instead. Refer to the WPI Robotics Library User’s Guide. If your initiialization was from 2010 and enabled watchdogs, then it should be "fed" periodically in your OperatorControl loop. Look at an example from 2010: http://code.google.com/p/chopshop-16...6495b511d cce "WPI Robotics Library User’s Guide": Note: The user watchdog timer is being replaced with the MotorSafety interface. MotorSafety is the preferred way to provide safe operation of motors on your robot. By default, the user watchdog is disabled (changed from 2010) and the MotorSafety timeouts are enabled by default on the RobotDrive object and not enabled by default on Victors and Jaguars. |
|
#14
|
|||
|
|||
|
Re: 706 had the curse of all curses
Our code started with the LineFollower example supplied for 2011. In this example there is no teleop code. The 2 lines to turn on the solenoid and wait were added to the teleop loop. So I don't believe we used a mix of 2010 & 2011 code unless something is wrong in the Windriver environment. This was something we tried in an effort to troubleshoot this issue. All updates were performed.
I believe it has to be something in the network communications traffic bewteen the Classmate the cRio as things would change when a different Classmate was used. -pete |
|
#15
|
||||
|
||||
|
Re: 706 had the curse of all curses
Man, I feel your pain. We've had two regionals like that. They happened on two different years and they were enough to make me say, "That's it, I quit". I said it both times, in '03 and '04--needless to say I'm still here. The one thing I've learned from those times is that they were the most beneficial experiences for following years. I know that that is the last thing you want to hear--perfunctory motivational comments--but it's true.
We now ALWAYS threadlock screws, support BOTH sides of output shafts from motors and gearboxes, have two people complete a checklist prior to each match etc... These and a myriad of other measures have kept Murphy [law] at bay. These are the seasons that make the wins taste so much better--and believe me, they will be delicious!!!! |
![]() |
| Thread Tools | |
| Display Modes | Rate This Thread |
|
|