With the rise of various machine learning algorithms, I have enjoyed trying to find any way to implement the, in FRC. One simple use case was vision. No more dealing with retuning at every event. I trained a simple SSD for 20 minutes on my way home from school and got some acceptable results.
For this test, I only used 20 frames because I needed something quick and I was too lazy to annotate more than that at the time. However, for future tests, I am doing 500. I plan on implementing this method to make life easier http://www.tmrfindia.org/ijcsa/v11i23.pdf.
DetectNet’s guts is googlenet, which was designed for ilsvrc, an annual computer vision challenge that was a 1000 classes and over a million images. I would be very skeptical about using it for 2 reasons. Firstly, it is extremely likely that you would be overfitting your model, which causes unexpected behavior on images not in the training set. Secondly, inference time. Googlenet is very large (though it uses x12 less parameters than alexnet). inference times matter when you have a clock running.
Our goal is to get output just like how it looks in the drive folder @AirplaneWins linked at the top of this thread. Our team tried methods other than machine learning, but they didn’t work; ML was our last resort.
There comes a moment in every frc student’s life where they are torn between exploring their technical creativity and strictly pursuing blue banners. The two are more often than not mutually exclusive. (That isn’t meant to deter you in the slightest, mind you)
While object detection is not a new concept, it has been revitalized thanks to things such as YOLO and r-cnn a few years ago. Time permitting, it would make the most sense to start from the foundations, and build up (not unlike how math is taught), but considering bag and tag encroaching on us, that might not be what you want to do.
In the spirit of modern machine learning (“there are no free lunches”): here is a collection of papers that would be worth reading. Do not let the fact that these are professional publications deter you.
AirplaneWins said he uses “a simple SSD”. The original paper is number 7 on the list I linked. With a rudimentary google search, you can find many open source implementations across various deep learning frameworks of your run of the mill single shot detector (SSD).
Just as Loveless said, there are tons of models you can try. After some more experimenting, I moved to an implementation of Yolo9000 as I’m trying to get as close to real-time as possible. I’ve had great success with it so far and really like the model. I did some custom changes to its architecture though, to make it easier to run on the rio/phone.I’ve attache the paper I’ve based it on.
A very big advantage that yolo9000 has is its hierarchical classification, thanks to some older concepts in natural language processing finally moving over to computer vision, as the paper you linked described. It is a excellent paper that is really pushing the field forward.
How many images are you training on? I would think about ten thousand or so would suffice depending on how you modified the model (and of course an at least equal amount of “other” images).
What I’m really curious about is how you (AirplaneWins) got a dataset of 20 images to work. We’d really like to know since getting 10,000 annotated images is not really feasible for us. Did you use Yolo9000 for those 20 images, or something else? What was the very first thing you did to get the output you linked on GDrive? Thanks!
Ah the glory days of being the lowest rank and having to manually annotate images.
I wouldn’t trust a DL model fine tuned on only 20 images. There is no way it is not overfit, and there is no way to tell how it will behave on images not very close to the training data. That is asking for trouble.
We know that 20 images is not optimal for machine learning. We know that AirplaneWins at the current moment has a much more ample data set.
My point is that, at the very beginning of this thread, AirplaneWins had 20 images and still got decent results:
Are we mistaken? If you indeed did use 20 images and managed to get adequate results, we would really like to know how you did specifically that, not how you would do machine learning in ideal conditions with an abundant number of annotated images.