Log in

View Full Version : [DFTF] We eat what we CAN, and what we can't, we CAN...


Joe Johnson
22-11-2011, 08:56
This is part of a series of posts called Drinking From The Firehose (http://www.chiefdelphi.com/forums/showthread.php?threadid=97430) on getting Dr Joe back up to speed on All Things FIRST.

In particular, I just posted a question on Jags v. Vics (http://www.chiefdelphi.com/forums/showthread.php?p=1086140#post1086140)

Today's topic:
We eat what we CAN, and what we can't, we CAN...

Yeah, I know the subject makes no sense, but there is that old joke I love about the farmer's answer to his city relative's question, "What do you do with all those tomatoes?"

The REAL topic is Jags and CAN.

Here is what I have heard from sources usually deemed reliable:

First, one danger of using CAN there are some cases where the CAN message blocks (i.e. halts) the code execution of the cRio. Threads can be used to help to contain the impact of the problem but it doesn't fully solve the problem. There are a bunch of smart folks working to try to fix this, but the end result is still uncertain.

Can anyone comment on this?

Second, if you get more than 3 Jaguars on the CAN bus, you get to the point where you can send out to the speed controller faster via PWM. From what I hear, you can send out all PWM signals at a rate of 200hz (5msec between updates) via PWMs. Note this is faster than your joystick data is coming over from the robot which I understand is 50hz (20msec).

On this topic, I am mixed. Yes there are times (in some special cases during autonomous mode, implementing a fast PID loop where sample times are critical, etc.) where 5msec update are an important feature, I can't see this being very often.

AND, if I am going to implement CAN, I am going to use PID on all but the most simple motor application, in which case, there is certainly no reason to update the Jaguars 200 times per second.

Does anyone have data on what the PID update rate is?

On another topic, I have a notion that I want to have my drive motors in PID mode but closing the loop not on position (via a pot input or via the encoders) but closing the loop the way motor control gurus have done for years, a fast loop based on velocity (in the Jaguar, hopefully) and a slow loop based on position.

There are several problems with this idea, the first of which is that the Jaguars do not support this mode.

I think it is worthy of a new thread...

All for now.

Joe J.

Al Skierkiewicz
22-11-2011, 10:11
Joe,
I can't speak to some of your post but the confusion may come from the need to control motor function via field controls. For this reason, CAN implementation requires some interaction so that E-stop, match start and stop, and automode commands take precedence over anything that might be commanded via the CAN bus. For safety, all robots must be able to be stopped when the filed controller or the personnel running the event (FTA) deems it necessary.

Jared Russell
22-11-2011, 10:56
To echo Joe's sentiments, I have been wondering for a while exactly what the PID implementation within the Jag looks like. Not just update rate, but also the form of their gains (there are dozens of perfectly valid ways of writing the PID equation, but they all do different things to the gains) as well as what anti-windup techniques and derivative term filtering techniques (if any) are in place. Does anyone have a document that provides this information?

Ether
22-11-2011, 11:09
Does anyone have data on what the PID update rate is?

I believe the Jag's PID update rate is 1000Hz.

I'll dig up a source.

Alan Anderson
22-11-2011, 11:38
To echo Joe's sentiments, I have been wondering for a while exactly what the PID implementation within the Jag looks like...Does anyone have a document that provides this information?

The Jaguar source code (minus FRC-specific safety features) should be available for download (http://www.ti.com/tool/rdk-bdc24) from TI.

Hugh Meyer
22-11-2011, 12:17
Joe,

Your summary of the “Jag” situation is dead on. This is based on my experience from using CAN since FIRST allowed us to use it.

One aspect that you didn’t mention is the interface to the CRIO. There are two primary interfaces, the serial port on the CRIO which uses a serial converter within the jaguar to convert to CAN and the 2CAN device which is a bridge between Ethernet and CAN.

When we started out using CAN I said we would try the serial interface, mainly to save money. We like to log lots of data, so we were attempting to transfer lots of data both directions using 8 Jaguars. It wasn’t very long until we discovered that the system couldn’t keep up. Switching to the 2CAN increased throughput by at least a factor of 10. Even with the 2CAN we could still take longer to transfer what we wanted than our periodic loop in the main code would support.

As a work around we changed our periodic loop to 100 msec and multiplexed the logging data. As I recall we divided up the requests for data from the Jaguars into 2. One pass through our main loop we request data from half of the Jaguars, and the next time through we get the other half. This seems to work good enough. But we must keep a close eye on the delays to be sure the code run time does not take longer than the periodic loop.

We monitor the code run time by setting a digital output pin high at the start of our loop, setting it low at the end of the loop and a second pin is a toggle at the start of the loop. We monitor these signals with an oscilloscope to be sure the timing is what we expect.

The code does indeed block and waits for the communication to complete each cycle. This wastes most of the time just waiting. Based on our testing it always does this.

I think a simpler system where commands are sent without waiting for an ACK would be fine. This is done all the time with DMX control systems in the theatrical lighting industry and it works fine. That way we could send out data in sequence to all the Jaguars, then by the time we did that, the first one would be ready to respond again. I am hoping we see a change in this direction, but that is beyond our control.

This year our team is discussing using the Jaguar PID to close loop on speed, then use a slow loop around that for position with the CRIO. I think that would work well. Another approach we may try this year is to go back to PWM and make a separate thread for each loop in the CRIO. I really like the CAN and want to stick with it. The data we get back from the Jaguars is priceless to knowing what is going on.

As Allan mentioned the source code for the Jaguars is the way to find out what the PID implementation is doing.

I hope this has been clear. If not I will gladly expand if you have specific questions.

-Hugh

Ether
22-11-2011, 14:25
I hope this has been clear. If not I will gladly expand if you have specific questions.

I have a couple questions Hugh:

We monitor the code run time by setting a digital output pin high at the start of our loop, setting it low at the end of the loop and a second pin is a toggle at the start of the loop. We monitor these signals with an oscilloscope to be sure the timing is what we expect.

The set hi/low is great way to observe the timing. What's the toggle for? To test for re-entrancy?


The code does indeed block and waits for the communication to complete each cycle. This wastes most of the time just waiting. Based on our testing it always does this.

You said block waiting. Did you mean busy waiting?

Hugh Meyer
22-11-2011, 15:35
Ether,

The toggle is to measure the periodic rate and trigger the scope. The Hi/Low measures our total code run time.

We call a function to send data. It does not return until the data is sent and the ACK comes back from the CAN bus. I call that blocking because the thread is stopped from running while waiting for the function to return.

-Hugh

Ether
22-11-2011, 16:30
The toggle is to measure the periodic rate and trigger the scope. The Hi/Low measures our total code run time.

Bear with me? I'm still not clear on this. You said "We monitor the code run time by setting a digital output pin high at the start of our loop, setting it low at the end of the loop and a second pin is a toggle at the start of the loop." (emphasis mine).

It sounds like you are setting one pin high "at the start of our loop" and you are toggling a second pin at the start of the same loop. Am I reading that correctly? If so, couldn't you trigger the scope on the rising edge of pin1, measure the periodic rate as the distance between pin1 rising edges, and measure the loop run time as the distance between pin1 rising and falling edges?


We call a function to send data. It does not return until the data is sent and the ACK comes back from the CAN bus. I call that blocking because the thread is stopped from running while waiting for the function to return.

In the blocked waiting state, the thread releases the CPU so other threads can run. So if it's blocked waiting, would it be possible to put the CAN data gathering in a separate 200ms thread so you wouldn't have to multiplex it?

Hugh Meyer
23-11-2011, 10:29
It sounds like you are setting one pin high "at the start of our loop" and you are toggling a second pin at the start of the same loop. Am I reading that correctly? If so, couldn't you trigger the scope on the rising edge of pin1, measure the periodic rate as the distance between pin1 rising edges, and measure the loop run time as the distance between pin1 rising and falling edges?


Yes, you are correct. That is what we are doing. Yes, the periodic rate could be measured that way, when all is well and working correctly. We did this 3 years ago when this was all very new. Our periodic rate was set much faster and we were attempting to transfer much more data. So, our code run time was longer than the periodic rate. We were in trouble shooting mode. When we finally realized the code was taking longer than the period things started to make sense. I just like the redundancy of measuring them independently so we still do it that way. Seeing both signals on the scope is a good visual confidence monitor.


In the blocked waiting state, the thread releases the CPU so other threads can run. So if it's blocked waiting, would it be possible to put the CAN data gathering in a separate 200ms thread so you wouldn't have to multiplex it?


We are planning to try a scheme like that this year.

-Hugh

Joe Johnson
23-11-2011, 15:56
<snip> 2CAN device which is a bridge between Ethernet and CAN.

<snip> try the serial interface, mainly to save money. <snip> we discovered that the system couldn’t keep up. Switching to the 2CAN increased throughput by at least a factor of 10. Even with the 2CAN we could still take longer to transfer what we wanted than our periodic loop in the main code would support.

As a work around we changed our periodic loop to 100 msec and multiplexed the logging data. As I recall we divided up the requests for data from the Jaguars into 2. One pass through our main loop we request data from half of the Jaguars, and the next time through we get the other half.

<snip>
-Hugh

This is from a few days ago, but now I finally have enough clarity of thought to ask questions about it.

2CAN:
$200 from AndyMark. I am not saying it is overpriced, but it is a lot of money. Is this the device of choice? Are there other possible choices?

100msec loop time:
Is that you MAIN loop time? 10Hz seems pretty slow to me. At 10fps the robot covers 1ft per loop. That seems a bit too slow for my blood. I suppose that if you had your Jaguars all in feedback mode, the robot behaved fine but what did your drivers think? I think it may be okay because my main motivation in this is not to make high speed better behaved but to make fine motions at low speeds better behaved.

I may even consider modifying the gains of my PID loop based on the robot configuation -- the robot in a compact position may need different gains from the robot with its arm reaching over some field obstacle -- it would be great if the driver didn't see a difference in behavior.

Data logging:
What sort of data were you logging and where we you logging it?

This part of the 2CAN documentation seems interesting:
The 2CAN also provides a web host dashboard that reports
information on all of the Jaguars on the CAN bus. Information
such as voltage, current, temperature, faults, etc… are displayed
in an easy to read webpage accessible to any laptop connected to
the Ethernet bus. This allows teams to perform quick diagnostics
on their motor controllers and also debug hardware issues

If this is true, then I think that the logging could all be done back on the Win7 PC via some Java code that scrapes the data off the 2CAN's web host dashboard.

Did you try this?

I don't know if it is true but I have heard that there is some magic (magic in fast and easy to implement) method to get data between the cRio and the Win7PC. If so, then there is yet another twist in the plot to consider.

More questions but my message is too long already.

Joe J.

Hugh Meyer
25-11-2011, 12:11
Joe,

2CAN:
Last year the 2CAN and serial port were the two options. NI must support them. This year may be different.

Loop Time:
100 msec is fine. We would like faster, but it is a trade off between CAN communication time and periodic rate. Our drivers couldn't perceive much difference between 100 msec and 50 msec. Fastest Windoze can do is 35 msec anyway. Anything faster is just an illusion.

Data Logging:
We log just about everything we can think of. Attached is a typical log file from a match during autonomous mode. We send data back in the user packet. It is 934 bytes (or something like this), which allows for lots of data. We have yet for that to be a limit. This data shows up in the dashboard. That lets us view on the dashboard in real time and write it to disk for later review.

2CAN Web Page:
The 2CAN will show much data if you hit it with a web browser. We could not do that and control the Jaguars from the CRIO at the same time. It is one or the other. Also, the configuration to get to connect the web interface to the 2CAN is not a legal mode, since you must have a path directly from the browser to the 2CAN. Normally the 2CAN must be connected to the CRIO for normal robot operation. This year may be different, since the 2CAN will be connected to the switch.

PID with two CIMS:
Last year I made a significant effort to make this work by connecting one encoder back to two Jaguars. We solved the ground loop issues with optical isolators. Every once and a while for some reason we would see the jags driving opposite directions. As best I could tell it was from the backlash in the gears. I used lower line count encoder wheels to reduce the ability of the system to see the backlash, which helped, but it didn't solve the problem. This caused us problems at Purdue, so we abandoned the effort and changed to voltage mode. The drivers said it was much more responsive when it worked. There are several threads where I talk about what we did last year with much input from others. Seems the only way to do this is to make the loop in the CRIO and just drive the two jags with PWM.

Fast Reversals on Jags:
This has been mentioned several times. During testing last year I discovered that the jags would reset when a fast reversal was attempted. Connecting a larger capacitor on the power supply seems to help this problem. We did this by installing it on the encoder pins in the jag. If you review the schematic you will see there is not much filtering on the 3.3 volts. Again, there are threads from last year where this is discussed.

-Hugh

Ether
25-11-2011, 12:26
Fastest Windoze can do is 35 msec anyway. Anything faster is just an illusion.

I'm skeptical about that Hugh. Where did you get that info?

Greg McKaskle
25-11-2011, 14:38
The link shown below are actual DS timings as measured by Marshal Horn. I believe the statement about Windows timing is referring to the typical 55ms resolution which is a holdover from DOS and IBM compatible computers. This is improved by using the multimedia timers and setting it to a better resolution. LV has been doing this since '95 or so.

Greg Mckaskle

Link (http://www.chiefdelphi.com/forums/showpost.php?p=988209&postcount=14)

Hugh Meyer
25-11-2011, 15:32
Ether and Greg,

I thought this statement might be questioned… :-)

Thanks Greg for the link. I would be interested to know more about how Marshal measured these. I am surprised to see so many points down in the 20 msec range, but notice there are several points in the 60 msec range, which is more where I would have expected. I would argue that the computer does in fact have a significant impact on performance.

Greg, if we are running C++ is it still using the multi media timers? Which ones are being used? Can you tell us the exact Windows system calls being used?

Below are some addition thoughts on this issue.

Windows is not a real time OS, so nothing is guaranteed in the time domain. Context switching is the heart of the issue. Windows has all kinds of processes running that do all sorts of functions. In the case of the FRC robot code getting the joystick data and dumping it out the network port is just another process that runs when it can.

Ultimately the fastest it can go depends on the number of processes running, the CPU speed, and the number of CPUs present. It is also related to “quantums” as defined in the windows world and what that is set to. It varies for different versions of windows.

Generally speaking the kb 259025 article refers to a quantum as 10 or 15 msec. Threads get 3 or 9 quantums. So that puts the thread run time up in the 30 - 135 msec range.

A glance at task manager will show many processes. While not all of them are ready to run, it doesn’t take many to get the time between “runs” to be longer than 50 msec.

So while a thread is running the service time could be shorter, when a context switch occurs the process is starved while every other process gets some time. During this pause the joystick data is just on hold, not really moving out to the network.

Another way to think about this is to imagine a snippet of software code that generates a periodic wave out a port. All will be fine, when the process is running, but as soon as a context switch occurs, the signal rate will stall until the process is switched back to run.

Here are some links with more information:

Context Switches
http://msdn.microsoft.com/en-us/library/windows/desktop/ms682105(v=vs.85).aspx
http://technet.microsoft.com/en-us/library/cc938613.aspx

http://support.microsoft.com/kb/259025
Quantums 10 or 15 msec divided by all the processes running would not support very high resolution.

http://stackoverflow.com/questions/2875178/how-long-is-the-time-frame-between-context-switches-on-windows
How long is the time frame between context switches on Windows?

-Hugh

Greg McKaskle
25-11-2011, 18:10
This link gives some info on the timers. There are other options for reading higher precision clocks, but for message timers, this is pretty good way to go for Windows. Note that the API doesn't guarantee that it will provide one ms resolution, but I've never seen a computer that had a value other than one.

http://msdn.microsoft.com/en-us/library/windows/desktop/dd743609(v=vs.85).aspx


Marshal was measuring the time at the robot. His first post has attached code, along with a much more interesting dataset from an earlier version of the DS when more UI stuff was causing overhead and lots of jitter.

Will C++ programs have the same resolution? It depends on how they are written. The LV editor and runtime are C/C++, so they call the multimedia API and increase the timer resolution. If they didn't do that, I think the modern Win32 event stuff would have a resolution of more like 16ms. I can't remember for sure what that number is, but the easiest way to measure it is to write a loop that measures the current time, sleeps for a small amount, and checks the time on wakeup. You can then compare the requested sleep amount to the actual time the thread slept.

I've attached images of some similar tests, each run on my VM'd windows 7 that is actually running on my mac. The top chart is a regular while loop with a sleep ms multiple, which attempts to preserve phase, but doesn't try to catch up. The bottom chart shows the same code, but with a timed loop, also set to discard misses, but retain phase. The timed loop in LV implements its own scheduler based on a high priority thread. It improves things a bit on windows, but really shows its stuff on RT OSes. Note that the first point is typically low because the loop is aligning to the clock phase. The tests were also run with standard priority.

As for the other articles you attached. I think that info is most appropriate for threads that do not yield due to I/O. If you have several computational threads maxing out the cpu(s), and you also have the timing threads running.

Due to technical difficulties with the attachments button, I'll attach loaded images in a followup post.

Greg McKaskle

Greg McKaskle
25-11-2011, 18:31
This is continued because I couldn't attach these images to the previous post.

These plots show what happens to the timing on Windows when two CPU maxed threads are running in parallel with the time/sleep loop. The top chart is once again timed using the ms multiple icon. The lower chart is the timed loop.

The second image is the same test, but with the time/sleep loops set to time critical priority while the CPU intensive tasks are normal priority.

The moral of the story?
Windows is not a realtime OS, but if you use the APIs well, you can get some pretty good timing from it. LV, in my opinion, makes it quite easy to play with priorities and timings and get a sense of how your system is running. You can certainly do the same with C++ if you are good with thread APIs.

Please note that all of this data was measured on Windows, and VxWorks should typically have even less jitter, but your milage may vary depending on how you use it.

To give details about the DS and the joysticks. The DS has lots of stuff going on internally, and the only high priority element is the control loop that sends the joystick and control data to the robot. Overall, the cpu usage for the app is quite low, and if you don't have lots of other stuff running, you should have pretty good response. The DS has graphs offscreen that I can use to measure its control loop rate under different conditions. Note that FIRST recommends, and I recommend that you go to the trouble of making a kiosk account for the Driver. This is really nothing more than changing the shell of that account to be the driver station. Since Windows Explorer is not even launched, quite a few processes and hooks are inactive. This is commonly done when LabVIEW and Windows are used for industrial monitoring apps or manufacturing test apps, and while it also will not make Windows an RT OS, it does help. It also helps to lock down test machines so that people don't play Solitaire or browse the web when they are supposed to be testing your electronics product. And when this is not good enough, or the PC is used for control, that is when it is important to move to an RT OS such as the one running on the robot.

Greg McKaskle

Mike Copioli
02-12-2011, 10:37
Joe,

2CAN:
Last year the 2CAN and serial port were the two options. NI must support them. This year may be different.

The 2CAN plugin is supported by Cross the Road Electronics, not NI. It is legal for any team to create their own Ethernet to CAN gateway. The API is the same for both serial and Ethernet the only difference is the plugin.



2CAN Web Page:
The 2CAN will show much data if you hit it with a web browser. We could not do that and control the Jaguars from the CRIO at the same time. It is one or the other.

This should not be the case , we update 10 Jags at a 5mS rate and view the web dash without issue. The Web dash transaction should be the only thing being delayed, if at all, since they have a lower priority than the Set transactions.


Also, the configuration to get to connect the web interface to the 2CAN is not a legal mode, since you must have a path directly from the browser to the 2CAN. Normally the 2CAN must be connected to the CRIO for normal robot operation. This year may be different, since the 2CAN will be connected to the switch.

I see no rule that states this configuration illegal as long as the "commands" are not altered.

<R59> If CAN-bus communications are used, the CAN-bus must be connected to the cRIO-FRC
through either the Ethernet network connected to Port 1, Port 2, or the DB-9 RS-232 port
connection.
A. Ethernet-to-CAN bridges or RS-232-to-CAN bridges (including the Jaguars, MDLBDC24)
may be used to connect the CAN-bus to the cRIO-FRC.
B. Additional switches, sensor modules, custom circuits, third-party modules, etc. may also be
placed on the CAN-bus.
C. No device that interferes with, alters, or blocks communications between the cRIO-FRC and
the Jaguars will be permitted (tunneling packets for the purposes of passing them through
an Ethernet-to-CAN bridge is acceptable as the commands are not altered).

Ether
02-12-2011, 20:16
Another way to think about this is to imagine a snippet of software code that generates a periodic wave out a port. All will be fine, when the process is running, but as soon as a context switch occurs, the signal rate will stall until the process is switched back to run.

Hugh,

I wrote a small 32-bit console app to do exactly that.

http://www.chiefdelphi.com/media/papers/2595

It toggles the RTS pin of the COM1 (RS232) serial port at 50Hz 75% duty cycle.

With a digital storage oscilloscope, you can see how Windows affects the signal.

For those without an oscilloscope who just want to play with it for fun, there's also a 1Hz version; with a cheap analog voltmeter you can see the RTS changing.

Have fun.

Greg McKaskle
02-12-2011, 22:46
Since I don't happen to have a scope, or a PC with a serial port for that matter, does anyone have a screen to share, or a description of the effect?

Greg McKaskle

Ether
04-12-2011, 01:03
Since I don't happen to have a scope, or a PC with a serial port for that matter, does anyone have a screen to share, or a description of the effect?

Started new thread here:

http://www.chiefdelphi.com/forums/showthread.php?t=98607