VTK: Next-Generation AI Vision Solutions for Future Games

Hello all!
My name is Nicholas Hubbard. I am lead vision programmer for Team 1701, The Robocubs, out of Detroit, MI. Over the last 6 months, I have been working extensively on a deep learning based object detector, which runs at over 60 FPS on a relatively old (Nvidia GeForce 960M) graphics card with near 100% accuracy.

Let’s start with what inspired me. The idea transpired from a small conversation on Slack with the team’s lead programmer, @noahhusby:

We used Limelight as our vision solution this year, and we were discussing what to use for distance tracking. I suggested deep learning based object detection as a stretch goal, and set to work on making it happen. As with anything hip and new, it was a difficult adjustment and required a ton of preparation at all of the events we participated in. It was only ready on the second day of the Alpena 2 competition we attended.

Initially, I started by aiming much lower with my goals hardware wise. I used my Motorola Moto Z3 Play as a testing ground for TensorFlow Lite Object Detection using a 1st generation MobileNet architecture with my own custom dataset. I found a training script (which I unfortunately cannot find now) and used it to create a model, which ran with some… flaws on my mobile device:

After this, I changed my model to use the TensorFlow Object Detection framework on a non-mobile device. I used a 2nd generation MobileNet architecture and 1,000 training images of each object to detect, generated from 20 second 60 FPS video clips of the objects moving around.

With conventional object detection, I bounding boxes for objects in each image, and wrote it into a file format the training script could understand, and started training on Google Cloud. It took a really long time (or what I thought was a really long time) to reach 250,000 training iterations.

With that, the results are excellent: I can run the model on a GeForce 960M with a heavily optimized copy of TensorFlow, along with Nvidia TensorRT optimizations applied, with 99%+ accuracy on both balls and hatches, in different light situations, angles, and distances from the camera. This all occurs while remaining within ±5 FPS of a constant 60 FPS. The real distinguishing factor is that any lighting change didn’t require any form of recalibration for event lighting. (I don’t have any final demonstration images right now, since finals are upon us at our school, but it may be possible in the future.)

Now, to address the real meaty details. I wrote a semi-scientific paper which I presented to the judges at the Alpena 1 event, and I have attached it here (90.5 KB). I also wished to make it easier to work with models, since it required a significant amount of boilerplate code beforehand. The library I created is called VTK, and it is available here: https://github.com/Robocubs/VTK, along with documentation at https://vtk.readthedocs.io/en/latest.

Thank you for reading, and hopefully this information helps you with your vision aspirations in the future.


This is really awesome. Lots of sources claim that the future of machine vision is NNs and not traditional CV.
Will that be true in FRC (discuss).
Now that the Jetson Nano is shipping, do you expect to adapt it to run on that?
Anyway, thanks for publishing!
I’m going to recommend that my team (4795) look.

Until I find a model that can run faster than traditional CV, I won’t be switching.

1 Like

I believe it does run on the Jetson Nano. I haven’t received my development kit yet, but it should work just fine. I’m told that the TX1 (the same platform as the Nano) is a little slower than the 960M hardware wise.

ninja edit: If you have the budget, a TX2 might be a better idea.

Great write up. Implementation and optimization of a NN for real-time use is a very tricky problem. But I can see big advantages in reliability and being able to use cheap cameras. Here are some feedback that may help you:

  • Try to include specifics on your training data (e.g. number of types and how many images of each type). You also did not mention separate test/validation sets- thats critical to proving that your model generalizes for different circumstances. BTW only 1k images per category maybe a little small.
  • Did you consider yolo (https://pjreddie.com/media/files/papers/YOLOv3.pdf) ? Yolo tends to be less accurate than the SSD (mobilenet is probably the best SSD), but it is also much faster. I’m not sure that the objects are that complex that a more accurate model is really needed. There is also a new model called “Objects as Points” (https://arxiv.org/pdf/1904.07850.pdf) that is even more accurate. Don’t worry if you didn’t see it earlier it only came out in april.
  • I’m not an expert on video- but video has problems like object blur (especially in dynamic environments) or object continuity (objects can’t pop in and out of existence from frame to frame.) that present unique problems. LSTM based models have been tried to deal with that- you may want to look at https://github.com/tensorflow/models/tree/master/research/lstm_object_detection.
  • You may want to look at google colab (
    https://colab.research.google.com/) they provide limited use of GPU/TPU for free. You may want to switch your training to 1080 or TPU which should provide you faster results.

Name-wise, I will say my first thought seeing “VTK” was the toolkit at https://vtk.org/.


Thanks for the good feedback! I’ll respond to all of your points for completeness sake.

  • The seed data I used included a training set of 1,000 images for each object, which according to Google was enough for a small amount of classes, along with a validation set of 1,000 images marked as -2 (false negative), -1 (false positive), 0 (true negative) and 1 (true positive) for each object. I believe that the COCO dataset has 3,000 images per object with 5,000 validation images for a total of 80 classes.
  • I did not consider YOLOv3 because of it’s release date. It also wasn’t supported in the TensorFlow Object Detection library for training or testing at the time.
  • LSTM based models did not have nearly the same size of community at the time. I don’t know if that has changed since then. Most of the problems I had occurred when the lighting suddenly changed and the autofocus on the camera was running (it took about 20 frames in my experience) or when the object was only partially present on screen (detection would only happen when the object was more than 50% present on screen).
  • I initially did try to use Google Colab. I found it to be extremely difficult to use because you needed to upload all of your files to Google Drive ahead of time as a ZIP file, write additional Python code to extract any necessary files, and the kicker was that your work would disappear 24 hours after you started using it.
  • Regarding the TPU: I did do a shorter (100k instead of 250k) training run on a TPU, but it ended up being 1) extremely expensive (the job cost nearly $190 on Google Cloud) and 2) ran at 100 training iterations per batch constantly (which meant that a short TPU training session took more time than a long training session on a group of GPUs).
  • Regarding training hardware: I used a group of 8 GPU workers and 1 master controller for training on Google Cloud Neural Network Training.

I happened to have some first person video from my team’s robot at champs, and I ran the model on it. It did’t seem to produce very accurate results. Here is the video: https://photos.app.goo.gl/NUagogSFNjq4tJee7 (the edges of the bounding rectangles are a bit thin). I’m wondering if this is the result of a relatively small training set? There was a post a little while ago where 5026 used YOLOv3 to label some of their match video. They had done the labelling off line, but were wondering about using a jetson to do it onboard the robot. Here is the link to that post: Vision rings or yolo for auto alignment?.

The issue there is YOLO. I’ve seen with my preliminary testing that SSD with MobileNet 2 is much more accurate and similarly performant to YOLO. How are you running the model, using VTK or a program of your own? I can provide the scripts I used to benchmark the model with live video.

It also might be the Limelight LEDs messing with some of the characteristics the model is searching for. My training dataset only consisted of indoor lighting systems, like fluorescent and white LED lights, that you find at competitions.

I labeled my video using VTK. Here is the script I wrote to use it: label.py. I would be interested in seeing your scripts, including those that you used to train the model. I have a bunch of first-person videos from my team’s robot and I am currently working on labeling them using supervise.ly. If you don’t mind, can you also share your dataset? In the YOLO post, 5026 shared the dataset that they used to train their model, so maybe you can add their training set to yours.

I think I found the issue with your example. You shouldn’t need to use the bbox_round function that you have, because the DrawingPostprocessor is designed to take the input of the TensorFlowInferrer class. Did you have an issue with the input when using the DrawingPostprocessor class? If so, that’s a bug that I can easily fix.

edit: Also the dark teal color on the boxes seems to be my fault. I can fix that with any bugfix we do for this issue.

def run(self, image: np.ndarray, inference: dict) -> np.ndarray:
    Draw the resulting boxes on an inferred image.
    :param image: Image to draw boxes on.
    :param inference: Output from an inference class.
    :return: Image with boxes drawn.
    for i in inference["detections"]: 
        cv2.rectangle(image, (i["bbox"][0], i["bbox"][1]), (i["bbox"][2], i["bbox"][3]), (125, 125, 0), thickness=2)
    return image

There is a bug where if the bounding box is not integers, it complains. I pasted the error below. You can reproduce by simply commenting out the line that calls round_bbox().

(virtualenv) ➜  vtkdemo git:(master) ✗ python3 label.py 04.MP4 04_labeled2.MP4 frozen_inference_graph.pb
starting frame 0.0 of 4526.0
starting frame 1.0 of 4526.0
starting frame 2.0 of 4526.0
starting frame 3.0 of 4526.0
starting frame 4.0 of 4526.0
starting frame 5.0 of 4526.0
starting frame 6.0 of 4526.0
starting frame 7.0 of 4526.0
starting frame 8.0 of 4526.0
starting frame 9.0 of 4526.0
starting frame 10.0 of 4526.0
starting frame 11.0 of 4526.0
starting frame 12.0 of 4526.0
starting frame 13.0 of 4526.0
Traceback (most recent call last):
  File "label.py", line 34, in <module>
  File "label.py", line 25, in main
    output = postprocessor.run(None, frame, results)
  File "/Users/acate/Documents/Dev/vtkdemo/virtualenv/lib/python3.7/site-packages/vtk-1.0.0-py3.7.egg/vtk/postprocessors/draw.py", line 24, in run
TypeError: an integer is required (got type tuple)

Also, do you have a repo with your benchmarking code/training code?

Alright, I narrowed the issue down to mainly lighting in the case of your video not having great results.

I also now realize what the issue is with the “integer is required” error: you can’t draw a box with partial pixels (should have seen that coming from a mile away), so OpenCV was rejecting the pixels without rounding. (ninja edit: The fix is now live on GitHub.) Thanks for the feedback!

Also I will create a Gist with the benchmarking code I used. Note that the benchmarks were done before I created VTK, so they are using raw TensorFlow code. Here you go: https://gist.github.com/nhubbard/32b873a43451f390b921d32de6421896

Thanks for making those updates! I would be surprised if it is lighting that is causing the issue since the limelight is only on during the first 15 seconds of the match. Also, I was messing around with your benchmarking code, and I noticed something interesting. By default, OpenCV used BGR for images which is why you convert to RGB before doing any inference. However, when I did the inference without converting, I got much better results. Here I have attached two images, one where the color is RGB and the other where it is BGR. The difference is quite clear.

After this discovery, I decided to test my video out again. First, I ran it with normal RGB. Then, I ran it with BGR. The difference was extremely dramatic. I will post the links to the respective videos when they finish uploading. Here is a quick visualization though. Each dot is a frame without any detections, and a D is a frame with a detection. The first image is the RGB run, and the second is the BGR run. It is clear that there were way more detections in the BGR run.

Did you train your model on BGR images, instead of RGB images?

Update: Here are the two videos. They are the same original video, but the first was labeled using RGB, and the second was labeled using BGR. Not only are there more detections in the second, but the detections are more precise as well.

RGB: https://photos.app.goo.gl/e8iJhVMNaxW2qDhN9
BGR: https://photos.app.goo.gl/RXJjjhjempmXjCx38

The other thing is that the detector doesn’t seem to be able to recognize the other smaller balls or hatches in the scene. For the group of balls by the side of the HAB, this might be because they are all close together and consequently hard to distinguish.


Any reason why you didn’t use a U-Net here? They’re pretty much the go-to neural net architecture for segmentation tasks – and you could easily reframe this problem in terms of segmentation instead of object detection.

Surprisingly, no I didn’t - but it’s good that it worked better on BGR! I might add a test case to check for that in the VTK test suite. Regarding the range and smaller items in the frame, the detector has a theoretical range of 4 to 5 feet on a properly positioned camera. It might have changed the color space during the training process, or it might be detecting it based on the other defining qualities of the item, not just the main major color of the item. For example, the detector was able to find hatches based on the outlines alone, and if pretty close, it was able to identify the cargo based on the patterns on the rubber.

I might look into that. I haven’t heard of U-Nets until your post mentioned it. Thanks!

I tested the detector on a solid color(picked from a ball in a frame of the video) and it detected a ball with the blue color, but not with the orange color. Here are the two detections:

Can you post your training set by any chance? I’m interested in comparing it to my videos.

I will have to locate the training dataset first… I don’t remember exactly which device I placed it on. It’s also about 1.5 GB, so I’ll need a lot of time to compress the dataset down to a reasonable size.

I also need to find a file host that can inexpensively manage such a large file.