Unable to train on AWS SageMaker

I am following the wpilib tutorial for machine learning. I’ve gotten to step 6 on the training instructions, but I keep getting the following error:

An error occurred (ValidationException) when calling the CreateTrainingJob operation: Access denied for repository: wpi-cpu in registry ID: 249838237784. Please check if your ECR repository and image exist and role arn:aws:iam::416515258441:role/service-role/AmazonSageMaker-ExecutionRole-20200126T110350 has proper pull permissions for SageMaker: ecr:BatchCheckLayerAvailability, ecr:BatchGetImage, ecr:GetDownloadUrlForLayer

I am using SageMaker in the us-west-1 region.

We are aware that you will get an exception while trying to train your model with the Jupyter notebook. The WPILib team and Amazon are working hard on a solution and hope to have something in the next few days. Until we do, you will not be able to train your models. We will post the status here as it changes, so please check back often. We are very sorry that this issue occurred and hope to get it resolved as quickly as possible.

Per the frc-docs page

3 Likes

Is there any known ETA for when this will be fixed?

There is not a set date, yet. We will keep the community updated. Sorry for the inconvenience.

I managed to get past the original error by building the image myself. However, when I being to train, I get the following in the logs:

Downloading model.
Successfully created the TFRecords: /opt/ml/input/data/training/train.record.
Successfully created the TFRecords: /opt/ml/input/data/training/eval.record.
Records generated.
Hyperparameters parsed.
Traceback (most recent call last):
File “accuracy.py”, line 59, in
main()
File “accuracy.py”, line 30, in main
checkpoint_max = max(checkpoint_nbs)
ValueError: max() arg is an empty sequence
Beginning training on Docker image.
cp: cannot stat ‘./learn/models/output_tflite_graph.tflite’: No such file or directory
Converting checkpoint to tflite.
rm: cannot remove ‘/opt/ml/model/output_tflite_graph_edgetpu.log’: No such file or directory
mv: cannot stat ‘/opt/ml/model/output_tflite_graph_edgetpu.tflite’: No such file or directory
Compiling model for Edge TPU

And then the error I get from Python is:
UnexpectedStatusException: Error for Training job wpi-cpu-2020-01-27-00-30-28-712: Failed. Reason: AlgorithmError: Exit Code: 1

I’m not sure if this was the error that was expected or if it helps at all.

As others alluded to, Wpi, First, and Amazon are changing the way our accounts are set up. Until then, we cannot run our notebooks. Though there is not an Eta, there was a seminar in NH Friday, and they were hopeful it would be soon.

Our team has been monitoring the frc-docs page, and there doesn’t seem to be anything new there.

Has anyone else heard anything that might suggest if there’s been any progress on this?

I have not heard of any official progress. However, I am currently working on a docker image to train a model. It’s based on the example provided by Google Coral, but modified to use our training data.

I am planning to release our training data, model, and training image as soon as I get it working.

2 Likes

That would be awesome!

This topic was automatically closed 365 days after the last reply. New replies are no longer allowed.