Log in

View Full Version : Robot Restart and Reset Issues


Jared
01-03-2015, 20:28
We're having a very strange problem with our autonomous code and we really need some help fixing it. Our code is here:
https://github.com/dicarlo236/2015-Robot

Our autonomous code generates several arrays of data during the robotInit method. This procedure takes around 400-500 ms, and uses 10 MB of RAM. These arrays are then read by two threads to control the wheels of our robot as we enter autonomous mode. This process is very sensitive to timing, as we are actively controlling position, velocity, and acceleration of the two wheels independently.

If we power on the robot and run our autonomous mode, we drive too quickly and crash into things. The drive movement is jerky and uncontrolled in autonomous mode.

If we power on the robot, wait for the code to start, and push the reset button on the roboRIO, the autonomous code works very well and we pick up three totes.

If we power on the robot, wait for the code to start, and click the reboot roboRIO or restart robot code button on the driver station, autonomous mode works perfectly.

If we power on the robot, compile and download new software, it works perfectly too.

I have replicated this issue many times, and I am sure that the cold boot of the roboRIO is causing our problem.

This weekend, we missed doing the reset trick once during one of our elimination matches and autonomous mode went wrong.

Next time we get to robot, I'll try running our auto setup code in a static initialization block (we're using Java) in the main robot class, or having it initialize auto mode 5 seconds after the user code starts running, but I do not know if this will fix the problem.

We also experience high CPU usage and a bit of lag with the controls before we've reset the robot code. The FTA's at the event could not explain the problem, and kindly allowed us to restart the code and waited for us, but we aren't happy with this solution.

I'm fairly convinced it's a bug with the framework as there should be no difference between a cold boot and a reset button boot.

Any suggestions?

GeeTwo
01-03-2015, 21:35
The most likely thing to cause this sort of behavior is if some piece of hardware (probably a sensor) other than the 'RIO needs a few hundred milliseconds to initialize. Once this initialization is done, the 'RIO initialization works.

Have you tried inspecting your 10MB data block to see what is different between a successful and a jerky startup?

Also you indicated that
Our autonomous code generates several arrays of data during the robotInit method.
If this data is only used for autonomous, and you can spare the half second, try moving this array generation to the beginning of autonomousInit() rather than robotInit().

virtuald
01-03-2015, 22:48
If I were to guess at this problem without looking at your code, I would say it has something to do with how you're dealing with time. In particular, this sounds like a system that doesn't allow 'missed frames', and tries to catch up to where it wanted to be, so it keeps executing them until it's caught up.

After looking at it, I noticed that your Updater thing uses a Timer.scheduleAtFixedRate, which looks like it might have a behavior similar to what I'm describing (http://docs.oracle.com/javase/7/docs/api/java/util/Timer.html#scheduleAtFixedRate(java.util.TimerTask ,%20java.util.Date,%20long)).

"If an execution is delayed for any reason (such as garbage collection or other background activity), two or more executions will occur in rapid succession to "catch up." In the long run, the frequency of execution will be exactly the reciprocal of the specified period (assuming the system clock underlying Object.wait(long) is accurate). As a consequence of the above, if the scheduled first time is in the past, then any "missed" executions will be scheduled for immediate "catch up" execution."

That sounds exactly like what's happening.

virtuald
01-03-2015, 23:07
After thinking some more, one thing that ties into this, is that the RoboRIO doesn't have a battery holding the system's time. So when your RoboRIO starts up it thinks the time is... some time in the past. Maybe an hour, maybe a day, who knows.

However, when the DS connects to your roboRIO, the time is synchronized to the DS's time. So time all of a sudden goes from 0 to 1 billion something.

If your code started before the DS synchronized time, then the fixed time delay mechanism would kick in and try to 'catch up'.

There's a few ways to fix this. One way that pops into my head is only start your updater thing upon the robot being enabled for something. Other solutions involve basing the time delay on something that isn't tied to the robot's datetime... I'm sure there's a source somewhere.

MrRoboSteve
01-03-2015, 23:10
virtuald makes good points. You could log the actual execution time and easily see whether that is happening.

Has anyone documented the GC settings on the roboRIO? I've wondered, but we're a C++ team and a bit busy right now to go figure it out. -XX:MaxGCPauseMillis would be really useful.

I'd move the autonomous selection logic from disabledPeriodic to autonomousInit().

Can you do your profile generation earlier (e.g., via an instance initializer (http://www.artima.com/designtechniques/initialization5.html)) and see if that helps?

It looks like you know the size of the list of Elements wrapped by Profile in advance. Consider modifying Profile's constructor to receive an initial capacity, and use that to set the size of the ArrayList<Element>. This will help your perf.

Jared
02-03-2015, 07:34
The most likely thing to cause this sort of behavior is if some piece of hardware (probably a sensor) other than the 'RIO needs a few hundred milliseconds to initialize. Once this initialization is done, the 'RIO initialization works.

Have you tried inspecting your 10MB data block to see what is different between a successful and a jerky startup?

Also you indicated that

If this data is only used for autonomous, and you can spare the half second, try moving this array generation to the beginning of autonomousInit() rather than robotInit().

We can't spare the time, as the generation has occasionally taken 1000+ ms to generate, and we need that time to back into the auto zone and drop the stack. However, I think I'll try having the generation happen a few seconds after the controller boots up instead.

If I were to guess at this problem without looking at your code, I would say it has something to do with how you're dealing with time. In particular, this sounds like a system that doesn't allow 'missed frames', and tries to catch up to where it wanted to be, so it keeps executing them until it's caught up.

After looking at it, I noticed that your Updater thing uses a Timer.scheduleAtFixedRate, which looks like it might have a behavior similar to what I'm describing (http://docs.oracle.com/javase/7/docs/api/java/util/Timer.html#scheduleAtFixedRate(java.util.TimerTask ,%20java.util.Date,%20long)).

"If an execution is delayed for any reason (such as garbage collection or other background activity), two or more executions will occur in rapid succession to "catch up." In the long run, the frequency of execution will be exactly the reciprocal of the specified period (assuming the system clock underlying Object.wait(long) is accurate). As a consequence of the above, if the scheduled first time is in the past, then any "missed" executions will be scheduled for immediate "catch up" execution."

That sounds exactly like what's happening.

I agree that we're "missing/skipping frames". However, this happens even 20 minutes after robot startup. What is different about background tasks 20 minutes after a cold boot and 20 minutes after pressing the restart code or reset buttons?

After thinking some more, one thing that ties into this, is that the RoboRIO doesn't have a battery holding the system's time. So when your RoboRIO starts up it thinks the time is... some time in the past. Maybe an hour, maybe a day, who knows.

However, when the DS connects to your roboRIO, the time is synchronized to the DS's time. So time all of a sudden goes from 0 to 1 billion something.

If your code started before the DS synchronized time, then the fixed time delay mechanism would kick in and try to 'catch up'.

There's a few ways to fix this. One way that pops into my head is only start your updater thing upon the robot being enabled for something. Other solutions involve basing the time delay on something that isn't tied to the robot's datetime... I'm sure there's a source somewhere.

I think you're right. The other code that I used timer.scheduleAtFixedRate() for won't have any issues with running too fast, but the autonomous driving stuff relies on that timing being spot on. The way updater works now is it starts the timer tasks on robot startup, which is unnecessary. We could likely have the updater not call timer.scheduleAtFixedRate() until the first updatable is passed to it, which would be as autonomous mode starts.

virtuald makes good points. You could log the actual execution time and easily see whether that is happening.

Has anyone documented the GC settings on the roboRIO? I've wondered, but we're a C++ team and a bit busy right now to go figure it out. -XX:MaxGCPauseMillis would be really useful.

I'd move the autonomous selection logic from disabledPeriodic to autonomousInit().

Can you do your profile generation earlier (e.g., via an instance initializer (http://www.artima.com/designtechniques/initialization5.html)) and see if that helps?

It looks like you know the size of the list of Elements wrapped by Profile in advance. Consider modifying Profile's constructor to receive an initial capacity, and use that to set the size of the ArrayList<Element>. This will help your perf.

We could figure out the size of the arraylist right now, but if we needed to change the auto mode to avoid colliding with other robots, the length of the list would change. We don't have a spare roboRIO, and our current one is in the bag, so I won't have be able to test any of my changes for a few weeks.

Thanks to everybody for the great advice- I'm really hoping we can fix this.

Mark McLeod
02-03-2015, 09:33
Initial date/time when roboRIO is booted seems to be 2/17/1970, 12:40am.

When the Driver Station connects, the time on the roboRIO will get sync'ed to the DS.
So timestamps may vary accordingly if time is taken before/after the DS connection is established or files are written/created before/after the DS connects after a roboRIO boot

MrRoboSteve
02-03-2015, 11:07
We could figure out the size of the arraylist right now, but if we needed to change the auto mode to avoid colliding with other robots, the length of the list would change. We don't have a spare roboRIO, and our current one is in the bag, so I won't have be able to test any of my changes for a few weeks.


You don't need the exact size of the arraylist when you allocate, as it will be expanded even if you set an initial size. Your goal here is to reduce the number of times the underlying array is grown.

I think this (http://docjar.com/html/api/java/util/ArrayList.java.html) is the current source for ArrayList. Note that the size starts out at 10 and grows by size >> 1. Each time it grows, there's a fully copy of the array made. (see method grow()). In your scenario, I'd set it to an estimate of your maximum size plus a buffer (say 10%).

virtuald
02-03-2015, 12:02
I agree that we're "missing/skipping frames". However, this happens even 20 minutes after robot startup. What is different about background tasks 20 minutes after a cold boot and 20 minutes after pressing the restart code or reset buttons?

On a warm reboot the time won't get reset back to 1970. You indicated above that deploying new code also fixed the problem -- which also won't reset the time back to 1970.

As a workaround, if you hit the 'reset robot code' button on the driver station, that would be equivalent to a redeploy, but much faster than a warm reboot.

Better to find a solution that doesn't use the Timer object, or look at its source code and see if you can override it and provide a different non-datetime dependent time source to it.

Coach#3536
02-03-2015, 14:58
If you are using the smart dashboard and data logging it messes up the field control. We experienced it until the FTA and our team found the backlog dumping and timing out. We suggest that you turn off the logging when playing on the field. It doesn't show up in practice mode and wont show up with a quick connect after start up on field. but if match start is delayed logging compiles and hold until communication start and then floods the system until it overwhelms field control and drops connection. We Followed WPI documents to the letter and it is just the way it is.

Greg McKaskle
02-03-2015, 15:44
Jared, can you find two log files of the two states? One that is the correct procedure with the warm boot and another with the code start.

Post all three files, .dslog, .dsevent, and .pdplog. If it isn't easy to post to CD, PM me and I'll give you a place. I think they may help indicate what ran differently.

Greg McKaskle

Jared
04-03-2015, 15:53
Jared, can you find two log files of the two states? One that is the correct procedure with the warm boot and another with the code start.

Post all three files, .dslog, .dsevent, and .pdplog. If it isn't easy to post to CD, PM me and I'll give you a place. I think they may help indicate what ran differently.

Greg McKaskle

Currently, I've only got the log files I've saved to my home computer.

From those, I've found one sequence of events that may help.

The first log file (2015_03_01 13_06_11 Sun) shows the robot connecting while tethered (cold boot), and enabled in teleop for a few seconds to adjust the elevator. There was a significant amount of control lag here, and if we had run autonomous, we would have had a problem. The dsevents file shows that I requested robot code restart at the end.

The next log file is when the code restarts and connects while I was tethered and in line, but is saved at the school computer.

Finally, I have the connection to the field (warm boot), where we play our match and things run well. This is the 2015_03_01 13_23_59 Sun file.

It makes sense that the CPU usage is high, as the controller is attempting to run the PID and drive controllers as fast as possible to catch up.

https://drive.google.com/folderview?id=0BzfxWVZUp3DNfnhUMUphTzM0bldPLTg0ejV DdThPaGhzMWk0aWRVYzBxRElmNlJPcmRwM1U&usp=sharing

ozrien
04-03-2015, 23:43
So when you power boot (like in the *_06_11_Sun capture), does it hold CPU +90 % indefinitely? Or does it eventually settle? It sounds like you watched it for 20min. Does it do this everytime? (I can't seem to reproduce it).

Robot setup looks like a bunch of PWMs, a few encoders, PDP on CAN bus (no other CAN devices), and four joysticks (Logitech Attack 3,Logitech Attack 3,Gamepad F310 (Controller),Logitech Dual Action) on the DS right?

ozrien
04-03-2015, 23:51
Does this happen when tethered as well? Usb or Eth?

ozrien
05-03-2015, 00:05
I tried to reproduce what's in your capture, deplyed, power boot and cycled teleOpEn and teleOpDis. Screenshot attached. The +90CPU seems to drop off quickly, maybe I'm missing something.

Jared
05-03-2015, 09:00
So when you power boot (like in the *_06_11_Sun capture), does it hold CPU +90 % indefinitely? Or does it eventually settle? It sounds like you watched it for 20min. Does it do this everytime? (I can't seem to reproduce it).

Robot setup looks like a bunch of PWMs, a few encoders, PDP on CAN bus (no other CAN devices), and four joysticks (Logitech Attack 3,Logitech Attack 3,Gamepad F310 (Controller),Logitech Dual Action) on the DS right?

The high CPU usage happens only sometimes. When it does happen, I have never seen it go down, even after waiting 20 minutes. After reimaging, the high CPU usage went away, but it has since come back. The screenshot you've posted looks like us right after we've reimaged.

Even without the high CPU usage, the robot still has jerky movement.

You're correct about the encoders, PWMs, and the PDP, but we've only got three joysticks.

Does this happen when tethered as well? Usb or Eth?

This happens when connected to the field, wirelessly connected at home, or ethernet tethered at home. I've never tried it with USB.

I tried to reproduce what's in your capture, deplyed, power boot and cycled teleOpEn and teleOpDis. Screenshot attached. The +90CPU seems to drop off quickly, maybe I'm missing something.

This is what will likely happen to us if we reimage and try again.

Kevin Sevcik
05-03-2015, 09:45
Have we confirmed that the Timer.scheduleAtFixedRate() is the problem and it's running multiple events to catch up from 1970 to 2015?

If not, the easiest thing to do is add a loop counter and watch it in SmartDashboard. Just create a static unsigned int or long, initialize it to 0, and increment it every time your Timer.scheduledAtFixedRate event is executed. That counter and (maybe) a stopwatch should make it obvious if the issue is the scheduleAtFixedRate thing or not.

If we HAVE confirmed that, then we should be talking about fixes and mitigation options, yes?