View Full Version : Lessons to be learned from Einstein reports
Focusing on some of the software specific lessons that we might be able to take away from the technical report on the failures encountered on Einstein, I know from personal experience and aiding and reading over the shoulder of very many a student and mentor from another team that the following issues are something that we should probably all be careful to take note of in the future. Feel free to add any other topics that we should all remember.
Excessive ethernet communications:
Transmitting too large images, at too high a framerate, as well as sending too much data far faster than it can possibly actually be required at.
Excessive 'useless' CPU utilization:
Some tasks have to run fast, all tasks are not equal, and they do not all have to run fast. Particularly, many teams put time consuming I/O instructions into code segments that are expected to run very often, resulting in full CPU utilization just in the processor trying to keep up.
Ignorance of the potential for 'partial' failures:
Things happen, wires get cut, parts of your CAN bus fail, that gorgeous IC2 chip you have gets smashed by a basketball. Software should be architected so that the failure of one system does not mean the failure of the whole system. This is particularly dangerous in the case where a failure with an I/O device causes excessive timeout delays, which lead to the 'useless' CPU utilization problem mentioned above.
Segregation of 'I/O' jobs, vs 'Processing' jobs:
This one isn't entirely necessary, per say, but something that in my opinion is good practice. Too many a PID loop running in threads driven by instruction packet receipt, alongside I/O tasks that could throttle the control system loop into running in a totally unexpected timestep. The more that your system can seperate I/O tasks from Processing tasks, and from each other, the likely better off you are. (Certainly not in all cases, and only to a certain degree, but in general, in my opinion)
Some things you can do to work toward these goals:
Have your I/O jobs work independently from your processing jobs, storing aside and fetching requisite data.
Work to ensure that I/O jobs can detect when there is a problem with their communications, and work to free resources for other subsystems when they do, rather than consume them.
Throttle processing and I/O jobs to say, double the rate they have to be performed at to generate the performance you require, rather than 1000 times that rate. (Faster is not -always- better) (I know sometimes those control loops do actually have to run quite fast, but they're a special exception and should be handled separately)
I know there's always a lot of talk on here about using different software problems to alleviate performance needs, but any software language can run software that leaves itself open to performance concerns. All of the languages that are currently (popularly) utilized by FIRST teams can perform the task put before us, but not if we get in their way.
Excessive ethernet communications:
Transmitting too large images, at too high a framerate, as well as sending too much data far faster than it can possibly actually be required at (did you know the default labview code sends driver station updates as fast as it receives instruction packets?)
The LabVIEW default code actually sends driver station updates 1/25th as often as it recieves instruction packets, AFAIK. We've modified ours to have this factor be 1/5th instead, but haven't actually tested the change on a real field yet. Either way, networking technology is pretty good. We should be able to stream 1080p video over the air easily in this day and age. This kind of data should be inconsequential.
Excessive 'useless' CPU utilization:
Some tasks have to run fast, all tasks are not equal, and they do not all have to run fast. Particularly, many teams put time consuming I/O instructions into code segments that are expected to run very often, resulting in full CPU utilization just in the processor trying to keep up.
I noticed that, more than once, the problem was using "yield" instead of "sleep" (For the uninformed, the "yield" function is basically like a sleep for 0 milliseconds. It doesn't actually wait any minimum amount of time, but if other things are scheduled, it will go process them.) This should be just as good IMO -- In this case, if you're hitting 100% CPU but still executing all of your code regularly, all that means is that you're being 100% efficient with the hardware.
If you weren't yielding OR sleeping in your secondary threads, or just not doing either often enough, that would be a different story. The only thing executing would be your heavy block of code.
Ignorance of the potential for 'partial' failures:
Things happen, wires get cut, parts of your CAN bus fail, that gorgeous IC2 chip you have gets smashed by a basketball. Software should be architected so that the failure of one system does not mean the failure of the whole system. This is particularly dangerous in the case where a failure with an I/O device causes excessive timeout delays, which lead to the 'useless' CPU utilization problem mentioned above.
I love this advice. On our team this year we tried to intentionally make the bare functionality of the robot sensorless, as a strict policy. Also, the driver should never be overridden by a faulty sensor. It wasn't perfect, since we did actually have sensor failures that costed us a match at GTR west, but perhaps next year we can implement this idea more rigorously.
Segregation of 'I/O' jobs, vs 'Processing' jobs:
This one isn't entirely necessary, per say, but something that in my opinion is good practice. Too many a PID loop running in threads driven by instruction packet receipt, alongside I/O tasks that could throttle the control system loop into running in a totally unexpected timestep. The more that your system can seperate I/O tasks from Processing tasks, and from each other, the likely better off you are. (Certainly not in all cases, and only to a certain degree, but in general, in my opinion)
We still don't do this, but we really should. I mean, I don't think we've ever caused ourselved to miss a teleop packet because we took so long to process the last one, but there are many other good reasons to be doing this anyway.
In fact I think this is one area where the default template could be improved -- it could be friendlier to the idea of offloading whatever you can into other threads. Maybe this will be coming next year, along with the "more thorough documentation regarding threading".
The LabVIEW default code actually sends driver station updates 1/25th as often as it recieves instruction packets, AFAIK. We've modified ours to have this factor be 1/5th instead, but haven't actually tested the change on a real field yet.
Looking at it again, you're right, my mistake.
Either way, networking technology is pretty good. We should be able to stream 1080p video over the air easily in this day and age. This kind of data should be inconsequential.
Keep in mind that 1080p video from most network services is heavily compressed (youtube for instance streams 1080p video using only ~4-6 mbps depending on compression algorithm). If some team were to implement something that ripped open the data structure of images being processed on the cRio and transferred them raw uncompressed 640x480 at 30 fps they'd actually use almost 27mbps for their single video stream alone. Something that their router could probably handle but would get their attention from an FTA (hopefully) pretty quickly.
To my understanding (stolen from a post of Greg Mckaskle's) the field network utilization during Einstein matches was recorded to be usually ~ 25 Mbits, or 3.12 mbps, which actually is already comparable with a 1080p video stream. Not to say that that's too much for the Cisco AP to handle, because it isn't, but Moore's law twisted to network demands and all, I think it's just something we need to make sure to keep in mind.
FIRST's report recommends adding QoS and bandwidth limiting to FMS router configurations, which should alleviate most issues that this could ever potentially cause.
It doesn't actually wait any minimum amount of time, but if other things are scheduled, it will go process them.) This should be just as good IMO -- In this case, if you're hitting 100% CPU but still executing all of your code regularly, all that means is that you're being 100% efficient with the hardware.
This is very true on many architectures, including I believe our's. However I'd argue that the loss of ability to quickly determine if a process could be overrunning (by checking CPU utilization) is a heavy price to pay in exchange for not having to fix the minimum timestep of a secondary thread which very likely isn't doing anything that demands all that CPU attention anyways.
In fact I think this is one area where the default template could be improved -- it could be friendlier to the idea of offloading whatever you can into other threads. Maybe this will be coming next year, along with the "more thorough documentation regarding threading".
Lately I've been encouraging the placement of control and I/O logic in separate tasks (structured like 'Timed Tasks' in our labview implementation) and having instruction packet triggered teleop code only process the command packets its receiving. I think we could all benefit from more thorough documentation though.
Al Skierkiewicz
15-07-2012, 11:11
Guys,
One of the biggest failings when speaking about cameras is discussing baseband video when it is in fact, MPEG. Teams should read a few papers in the off season on how MPEG actually encodes video and how it transmits data. Merely reducing frame rate and resolution do not reduce bandwidths when fast moving objects and moving backgrounds are in the field of vision. The codecs used to compress MPEG video for HD TV are complex and highly adaptive/predictive. You don't notice many of the artifacts, but professional video people do. Just reducing video noise is one of the steps employed prior to encoding as the noise is considered live video by the codecs. Aggressive noise reduction can actually make rain disappear in video.
Guys,
One of the biggest failings when speaking about cameras is discussing baseband video when it is in fact, MPEG. Teams should read a few papers in the off season on how MPEG actually encodes video and how it transmits data. Merely reducing frame rate and resolution do not reduce bandwidths when fast moving objects and moving backgrounds are in the field of vision. The codecs used to compress MPEG video for HD TV are complex and highly adaptive/predictive. You don't notice many of the artifacts, but professional video people do. Just reducing video noise is one of the steps employed prior to encoding as the noise is considered live video by the codecs. Aggressive noise reduction can actually make rain disappear in video.
Also note that most video compression algorithms throw away more than 99% of the data of baseband video. Uncompressed 1080p is around 3000 Mbps or 3 Gbps. Even high-quality Blu-Rays are encoded at around 35 Mbps. Baseband 720p/1080i is compressed down from around 1.5 Gbps to under 18 Mbps and sometimes lower than 5 Mbps for broadcast.
vBulletin® v3.6.4, Copyright ©2000-2017, Jelsoft Enterprises Ltd.