FIRST: take advantage of your mentors expertise

We all know we had a huge problem at Einstein on Sunday. But when the going gets tough, the tough gets going and its time to get going. I want to point out there is a huge pool of mentor expertise that happen to have no robots to build for a while. FIRST is highly resource bound and they should take advantage of the pool to build a much more rich set of across the board field and robot diagnostic tools over the next nine months.

My experience is a rugged design can never have too much of the following

  • redundancy,
  • significant spare performance headroom (bandwidth, network, cpu cycles etc)
  • custom highly focused tool sets,
  • maximum real-time fault detection with clear messaging and logging
  • super smart fault recovery code

I think the mentors could put together a nice inexpensive realtime datalogger that records any abnormality/exceptions during say the last 15 seconds on a flash card and has a really smart diagnostic messaging interface to monitor key system events such as high and low power exceptions to the Crio and bridge, monitor in and outbound network traffic rates and key parameters, events. The Crio could communicate with it via the network or I/O card capturing exception key events especially those that lead to the Crio doing a reboot. A $20 Arduino + a little extra hardware should be able to do it.

I think Dlink could give FIRST access (via special firmware if necessary) to key radio performance parameters: Signal quality, error/retry rates, throughput rates, capacity loading etc.

I would also like to throw the ball in National Instruments court to

  1. speed up the rebooting of the Crio. It takes forever in a 150 second match. A high value real-time system I worked on could reboot and have the control program in control in a fraction of a second while still saving the complete pre reboot memory image reboot for later crash dump analysis. Imagine if a robot would reboot in a few seconds.

  2. optimize the “critical error forcing a reboot” code to store a simple human readable reason in a location that can be easily accessed after the reboot (or perhaps force it out a the serial port or I/O card for the datalogger) before rebooting. No more guessing did it reboot and why.

My goal here is not do a design in a forum but get FIRST to take advantage of its user expertise over the next 9 months to build the most rugged survivable diagnostic enriched robot and field system design FIRST has ever had. FIRST are you listening ?

There was an interesting post here located in this thread that talks about what could be causing such pervasive connectivity issues.

MAldridge raises an interesting point about the firmware that runs the FMS.

You will quickly find that it is only barely stable, and that the it suffers from huge amounts of odd quirks inherent to VB.NET apps. It is clear to me that the FMS program itself is the root of this evil. Since it handles the WiFi system, it is also to blame for the oddness of connections that work perfectly on one field but not on another, or robots that work on the practice field, but not on the real field. The FMS program is in dire need of a re-write, preferably in LabVIEW, which most FTA’s can debug errors from very quickly.

I can’t make a recommendation on what language it should be rewritten in (I’m a drafter Jim, not a programmer!) but according to his research that seems to be the cause of many (if not all) of the connectivity issues. How is this on topic with the subject at hand you might be asking? Well in regards to the OP about utilizing the resources that FIRST has, if FIRST was allowed to install custom firmware onto those FMS devices (the answer of which, I’m not sure about to be honest) they could have the hundreds if not thousands of programming mentors out there (and even some of the student prodigies) contribute to an open source firmware initiative to use on the FMS and (possibly) solve these connectivity issues. I could imagine FIRST starting an account on GITHUB for everyone to contribute code and get some properly working and stable firmware. FIRST is all about bringing people together in a mutual love of science and technology, and what could be more indicative of that than a crowd-sourced FOSS initiative for fixing any software bugs/glitches in the FMS firmware?

Just my 2 cents. I’m sure I made some incorrect assumptions as well as used some terminology incorrectly in a few places, but I hope I got the point across clearly enough.

I’m sure that can be done. Just one point. Flash cards are relatively slow. that’s why my FPGA oscilloscope I made for the robot uses computer memory (still not as fast as static RAM like that which you find inside the ATMEGA on the Arduino).

How much of a factor the storage speed effects this design will depend on the maximum samples per second you desire. If you just want say 5,000 samples a second there are existing SD memory based data loggers that can handle that. That’s really getting towards the maximum number of samples per second of most TrueRMS DMM as well.

For example:
“…it has a 1ms (1000 samples per second) Min Max mode, the newer models; Fluke 87 Series III, and Fluke 87 Series V, have a 250µs (4000 samples per second) mode…”

Some examples:

Something to think about with this:

See FAT16Lib’s post:
"This is a very good card for data logging. Not because of the average write rate of 198 KB/sec but the max latency of 81396 usec. Some cards have a max write latency of 200000 usec.

The 198 KB/sec is way faster than needed to log 1000 records per second with 14 byte records.

Once again the design problem is to overcome the occasional long write latency that is inherent in SD cards.

Even class 10 cards have this problem. The assumption is that devices like video cameras have lots of buffer so they achieve high average write rates for very large files. This allows occasional pauses by the card to erase large areas of flash. You can only write to erased flash."

Thanks for some great links and info.

Re logging: I really only hoped to log exceptions (out of normal parameters for an operating robot) so hopefully we are not talking 1000s of writes per second and if we are, as you pointed out, it could be written into the Arduino memory and then dumped offline to a USB stick or compact flash card).

Part of the goal is to give a team a good development and testing system problem solving tool and partly to provide supporting evidence of field issues versus robot issues.

Please. Please make this happen. It’d be a dream come true. :smiley:

As I have offered with the electronic motor controls and with LinuxBoy’s CAN terminators. Anyone that requires a little prototype help with this please let me know.

The issue of sample speed plays to how fast a ‘glitch’ or transient you can catch. For example let’s say that you hit a bump and your D-Link AP radio connector opens up for 1uS. If you happened to take a sample before or after you might not see that happen. It’s an effect of taking discrete samples and it’s part of quantization.

The idea of holding off capturing huge amounts of data, and limiting the duration of the data you do capture by time is a very good idea. It means that you could sample more frequently during the time you do sample because you suspect you have a problem you definitely want to catch. This also means that you limit the amount of pointless stuff you need to see later.

Also if you have a sample buffer fast enough to collect data very quickly there’s nothing stopping you from using it slowly as well if you desire. In that way if you wanted you could still capture all of the voltage you monitor throughout the duration of a match without additional hardware. You just need to be able to adjust the trigger and sample speed.

Yes these terms are all familiar to people that use oscilloscopes.

re sampling: I’d thought capturing power line transients might best be done with a custom analog over voltage or under voltage threshold circuit that thats trigger a counter that measures the length of the glitch and stops when the power returns to within normal bounds. The logger gets interrupted and it reads the count. Might need a little distributed PIC to do it. The threshold under voltage or over voltage threshold would be such that could or would cause corruption, hangs or reboots in the CRio and other critical devices. The logger would record a “device Bridge PS undervoltage for X ms” event.

Overvoltage detection would immediately identify situations where the bridge or camera were accidently powered with 12v, both of which happened to us this year. Also our Crio rebooted a number of times, cause never absolutely confirmed.

I completely agree.

That’s what I already made a set of modules with op-amps that have a voltage reference, a voltage divider and a latch at the output (a latching voltage comparator).

You set them for the voltage you want to consider too high, or too low and when it hits that point the output sets and stays that way till you clear it.

You can feed that output into the cRIO, drive an LED with it, or connect it to something like a relay as a breaker.

The fancy version uses digital pots (R2R ladders) and is actually calibrated and self testing from a small hand held unit I made.

I originally created them because I kept getting told Team 11’s robot mounted Jaguar electronic motor controls had power quality issues last year. So I finally got fed up and made something fast enough (and practical to use on a moving robot) to demonstrate the truth or falsehood of the statement. It wasn’t true.

Obviously there is a time for the output to transition, so there is a response time, but it’s much smaller than for a data logger, and approaches the sort of response time you get from an oscilloscope. Really I could actually do this with discrete transistors or on a wafer…but this is a kid’s toy.

The reason I didn’t set the circuit up to time the length of the glitch is that it’s once again a quantization issue. Even if a small cheap PIC looks at the output of the voltage comparator if the glitch is fast enough it just disappears. With the latch you can force a really short glitch to show up…even if your time counter resolution is too low. So what if your counter misrepresents a 0.1uS glitch for something longer? It’s still a glitch and it did not slip by.