Any success with Axon?

Has anyone had any luck with the new Axon tool from WPILib?

I was able to get the tool up and running, but I haven’t had any success creating trained models that can identify anything. I created some data sets from Open Images but my model was not able to identify the objects reliably in a test video. I also tried creating a couple data sets with Supervisely, but I was not able to train models with them, they never generate any checkpoints in Axon.

1 Like

I also have not been able to successfully generate any checkpoints. I have created three datasets (one with 15 images, one with 61 images, and one with 140 images) using Supervisely. If I attempt to run Axon to do the training, either A) it finishes after 2 minutes, says it is done, but there are no checkpoints, or B) it runs forever and continues to say 0% complete. The longest I let it run was 5 days - it still said 0% complete even thought it was using a lot of CPU - and this was on a pretty powerful laptop, although it does not have a GPU. Team 7028 says they had similar results. Team 2974 says they got a dataset with 7 images to produce a checkpoint after 2-3 days of running, but they let me borrow their dataset and I could not get it to produce a checkpoint even letting it run for 5 days. I have tried running on a Mac and also on a Windows machine. I created a TeamForge ticket on the FRC Beta board for this on 12/26/21, but never got any response.

Thanks @tomjwinter. I was able to get Axon to train a model with Open Images, but the model was not able to identify the objects when it was done.
I am from team 7028, and I am the one who replied to your post on TeamForge :slight_smile:

I’ve had a lot of success with Axon. One of the things to note is that it seriously takes a long time and a huge amount of compute to actually train these models. If you want to see progress I would suggest putting the checkpoint generation at a really low value, but in reality it just takes forever regardless. Also the default batch size is very small and I found upping it has a really large impact on the speed at which your model will train (Although you do need a lot of system memory). Right now running a Ryzen 5950x with 16 cores and 64gb of ram gets me through around 200 epochs in ~2 hours with the batch size set to 200, evaluation frequency set to 10 and percent evaluation set to 70% (my dataset is around 2000 images). The model itself once trained actually pretty damn good! I’m looking forward to trying it out this year! Also P.S. You seriously do need the TPU, without it the fps is horrible. I get about 16 fps on a Pi 4 with the coral TPU.

Also I did manage to train this on my laptop as well, but it did take around 6 hours (8 core 5900hs and 40gb of ram) - epoch generation at 5, and batch size set to 120.

3 Likes

Note you don’t need a super computer either. I’m running an oc’d Ryzen 5 3600 and 16gb of memory and I made a very basic program that could track a stationary cargo at almost 90% (assuming constant distance in a rotation circle) with just 40 images from sueprvisely. I used MUCH smaller batch sizes, 10-20, and could run 200 epochs in an hour. Do note that large batch sizes is what really kills Axon. It always maxed my memory, but at too high batch sizes it would fail to load and be stuck at zero, similar to what Tom had mentioned.

Axon (on a windows vm) seems to have no access to the gpu, which really sucks (if someone knows a fix for this lemme know), so yes it’ll pin your cpu. However, it should go in cycles. If you pull up task manager or equivalent and view all the threads, each core/thread should spike and then decrease as it processes an ecpoch. If it pins at a 100% on all, axon got stuck somewhere and you should sometimes wait a few min to see and then restart if needed. I think this is due to memory issues as it happens more often if batch sizes are larger, but this is a baseless observation.

There are some bugs. Sometime the progress bar just fails to keep going but it will still produce evaluations. Like I said it’ll randomly permanently pin the cpu at 100%. But it seems to work. Tinker with it a bit.

OH. Note: if you’re on windows and running it on the docker vm, you need to enable vm’s in your bios. If you’re unsure how to do this, Google your motherboard and see where the vm setting for your cpu is. Or go to the bios and search around. Some proprietary boards may just have it off permanently.

I’m testing a Pi4b with 8gbs ram tonight to see how it can handle without a TPU since you can’t buy them ATM lol.

1 Like

Thanks for all of the suggestions! I have finally gotten my first checkpoints after making a bunch of changes to the parameters. I am going to experiment a bit tomorrow to figure out which parameters allowed me to go from “stuck forever” to “training successfully”. I will report back when I have more info.

So I just ran some tests with the pi4b, and well, I got on average ~3.5fps at 480p without a tpu lol. I wouldn’t recommend following this route unless you can get your hands on a Coral TPU.

3 Likes

Looks like the Coral TPU is going for around $175 on eBay… steep markup vs the $60 retail!

1 Like

How much is it needed? Does running a model is slow on a raspi?

Edit: Just saw Falln comment above lol

Is there an alternative to the Coral TPU that can be used?

Don’t need to use a Coral at all. Just a raspi would work fine. If you can get a coral, that would help. It uses special drivers, so there is no direct replacement.

1 Like

Excellent post! I guess there are a lot more errors relating to OOM I need to catch. Python is not easy on me.

How high is your resolution? I was averaging about 30fps, I think with a Pi 3? *with a Coral

If the cargo isn’t moving, even with the lower frame rates you can run your control loops using other sensors and just use the camera for setting targets. Use a gyro (navx) for steering and encoders for distance. They update very fast. Then take new updates from the model to set the heading and distance.

I wish I could use Axon with the Jetson TX2 we bought and never used!

3 Likes

You definitely can, you just need to write some code yourself :).

The key components are installing network tables, cscore, and the tflite runtime. I have never done it before on a jetson

1 Like

Could you maybe give a little more info on how to do this or add a link for some useful sites? The coral edge tpu and a Jetson nano are around the same price right now and the Jetson is much more powerful, so I’d rather buy a Jetson if the team allows it. (I’ve also already looked into other programs other than axon for ml, but at that point, I might just use a pixycam and use hsv filtering lol)

1 Like

I believe I was running it at 640 x 480

I had some interesting results when doing testing today. Rather than trying to interpret them for you, I will just give you the raw data.

I have 4 different Axon Datasets, and 4 Axon projects (1 for each dataset). The tests in this chart were run in the order listed. You will notice that for project 3, at first it failed, but after finding parameters that worked, even the original parameters now worked too.

Projects 1 and 2: I could get these to work if I found the right parameters. They failed again if I set the parameters back to the original parameters.
Project 3: I could get this to work if I changed the parameters. It continued to work even after I set the parameters back to the original parameters.
Project 4: I could not get this to work, even with changing the parameters

Also, the % complete or the “Epoch X / Y” does not seem to be very accurate. A better gauge of progress seems to be to just see how many epochs are listed in the graph.

I ran these tests on a 2019 MacBook Pro with 16GB of memory, a 2.3 GHz 8-Core Intel Core i9 processor, and an AMD Radeon Pro 5500M 4 GB GPU.

The raw data is provided as an image.

1 Like

My labels got cut off a bit. The times in that chart are in minutes. So all of my successful trainings took somewhere from 8 to 20 minutes.