My name is Nicholas Hubbard. I am lead vision programmer for Team 1701, The Robocubs, out of Detroit, MI. Over the last 6 months, I have been working extensively on a deep learning based object detector, which runs at over 60 FPS on a relatively old (Nvidia GeForce 960M) graphics card with near 100% accuracy.
Let’s start with what inspired me. The idea transpired from a small conversation on Slack with the team’s lead programmer, @noahhusby:
We used Limelight as our vision solution this year, and we were discussing what to use for distance tracking. I suggested deep learning based object detection as a stretch goal, and set to work on making it happen. As with anything hip and new, it was a difficult adjustment and required a ton of preparation at all of the events we participated in. It was only ready on the second day of the Alpena 2 competition we attended.
Initially, I started by aiming much lower with my goals hardware wise. I used my Motorola Moto Z3 Play as a testing ground for TensorFlow Lite Object Detection using a 1st generation MobileNet architecture with my own custom dataset. I found a training script (which I unfortunately cannot find now) and used it to create a model, which ran with some… flaws on my mobile device:
After this, I changed my model to use the TensorFlow Object Detection framework on a non-mobile device. I used a 2nd generation MobileNet architecture and 1,000 training images of each object to detect, generated from 20 second 60 FPS video clips of the objects moving around.
With conventional object detection, I bounding boxes for objects in each image, and wrote it into a file format the training script could understand, and started training on Google Cloud. It took a really long time (or what I thought was a really long time) to reach 250,000 training iterations.
With that, the results are excellent: I can run the model on a GeForce 960M with a heavily optimized copy of TensorFlow, along with Nvidia TensorRT optimizations applied, with 99%+ accuracy on both balls and hatches, in different light situations, angles, and distances from the camera. This all occurs while remaining within ±5 FPS of a constant 60 FPS. The real distinguishing factor is that any lighting change didn’t require any form of recalibration for event lighting. (I don’t have any final demonstration images right now, since finals are upon us at our school, but it may be possible in the future.)
Now, to address the real meaty details. I wrote a semi-scientific paper which I presented to the judges at the Alpena 1 event, and I have attached it here (90.5 KB). I also wished to make it easier to work with models, since it required a significant amount of boilerplate code beforehand. The library I created is called VTK, and it is available here: https://github.com/Robocubs/VTK, along with documentation at https://vtk.readthedocs.io/en/latest.
Thank you for reading, and hopefully this information helps you with your vision aspirations in the future.