Skip to main content

Parallel Model Architecture Search With Training Jobs

This tutorial should cost less than 0.8 credits ($0.80) if you use the GTX 1060 instance type and the same training settings as the guide.


Some parts of this tutorial use advanced features that require the connection capability, ensure that your environment meets the requirements and you have installed all the prerequisites.

This tutorial uses the PyTorch example models to perform a parallel model architecture search on ImageNet.

Using the same procedure as the getting started guide instructs, you can PyTorch Examples Repository to test what computer vision model architecture is most promising.

Creating the Training Job

Login to the trainML platform and click the Create a Training Job link on the Home screen or the Create button from the Training Jobs page to open a new job form. Enter a memorable name for the Job Name like PyTorch ImageNet Architecture Search. Select the cheapest available GPU Type and leave the GPUs Per Worker as 1.

In the Data section, select Public Dataset from the Dataset Type field, then select ImageNet from the Dataset field. This will automatically load the ImageNet dataset into the /opt/trainml/input directory of each job worker. Since this is a trainML supplied dataset, you will incur no additional storage cost for using this data. For Output Type select Local and specify an existing directory on your local computer, like ~/pytorch-example/output in the Artifact output storage path field.

In the Model section, specify the URL of the PyTorch Examples repository,

The ImageNet example allows you to specify which model architecture to train with. You can see the available list here. This tutorial will compare resnet50, vgg16, wide_resnet50_2, and alexnet. To train all four models in parallel, select 4 from the Number of Workers field in the Workers section and ensure that Different Command for Each Worker is selected. Enter the following commands into the Command for Worker fields:

  • Worker 1 (resnet50): cd $TRAINML_OUTPUT_PATH && python $TRAINML_MODEL_PATH/imagenet/ --batch-size 32 --epochs 2 --arch resnet50 $TRAINML_DATA_PATH 2>&1 | tee $TRAINML_OUTPUT_PATH/train.log
  • Worker 2 (vgg16): cd $TRAINML_OUTPUT_PATH && python $TRAINML_MODEL_PATH/imagenet/ --batch-size 32 --epochs 2 --arch vgg16 --lr 0.01 $TRAINML_DATA_PATH 2>&1 | tee $TRAINML_OUTPUT_PATH/train.log
  • Worker 3 (wide_resnet50_2): cd $TRAINML_OUTPUT_PATH && python $TRAINML_MODEL_PATH/imagenet/ --batch-size 32 --epochs 2 --arch wide_resnet50_2 $TRAINML_DATA_PATH 2>&1 | tee $TRAINML_OUTPUT_PATH/train.log
  • Worker 4 (alexnet): cd $TRAINML_OUTPUT_PATH && python $TRAINML_MODEL_PATH/imagenet/ --batch-size 32 --epochs 2 --arch alexnet --lr 0.01 $TRAINML_DATA_PATH 2>&1 | tee $TRAINML_OUTPUT_PATH/train.log

This makes use of the trainML preset environment variables to make specifying the data, model, and output directories easier. First, since the PyTorch Example script saves its checkpoints to the current working directory, you need to set the working directory of the python process to be TRAINML_OUTPUT_PATH in order for these checkpoints to be uploaded to your artifact storage location. To do this, the python command is prefixed with cd $TRAINML_OUTPUT_PATH && and the model file is called using it's absolute path $TRAINML_MODEL_PATH/imagenet/ Second, since the PyTorch Example does not create log output for TensorBoard or a similar framework (if you want to enable this, follow the guide here), you can save the script output to a file that will be uploaded to the artifact storage location by postfixing 2>&1 | tee $TRAINML_OUTPUT_PATH/train.log to the command. The tee command allows you to both save the output to a file and continue to send the output to standard out. If you redirect standard out without using the tee command, the file will be populated, but you will not see any output in the log stream viewer.

The main part of the command uses the recommendations from the repository. The batch size is set appropriately to avoid running out of GPU memory, and the epochs is set to show the accuracy difference between the models without using too many credits. The --arch option is what enables each worker to train on a different model architecture.

Click Next to review your settings. Click Create to start the job.

Monitoring the Job Worker

To monitor the progress of the job, connect to the job by following the instructions in the getting started guide. Once the worker initialized, you will begin to see output like the following:

Worker 1: Epoch: [0][200/791]	Time  0.572 ( 0.222)	Data  0.548 ( 0.176)	Loss 3.0392e+00 (3.0217e+00)	Acc@1   0.00 (  5.16)	Acc@5  28.12 ( 25.79)
Worker 1: Epoch: [0][210/791] Time 0.710 ( 0.227) Data 0.677 ( 0.181) Loss 3.0638e+00 (3.0218e+00) Acc@1 0.00 ( 5.15) Acc@5 12.50 ( 25.83)

Connect to the job using the connection utility so that the job workers can upload their results to your local computer.

When training completes, the worker output the final accuracy and loss metrics. If you return to the trainML dashboard, you will see the job has stopped automatically.

Analyzing the Output

Depending on size of the model artifact directory, it could take several minutes for the training output file to appear in the artifact storage location. The vgg16 model checkpoints, in particular, are large (almost 1 GB each) and the upload could take up to an hour.

Analyzing Results

Once the job terminates, check the output path you specified in when creating the job. There should be four files with a naming convention of <job_name>_<worker_number>.zip, for example,, etc. Download these files, and extract their contents.

Each file a train.log file, as well as two checkpoint files, one that is the last one to be generated checkpoint.pth.tar and one that had the best accuracy score model_best.pth.tar. Open all four train.log files and scroll to the end of each file. The last line contains the final accuracy score for each of the model architectures. You will see something similar to the following at the end of each worker's file:

  • Worker 1 (resnet50): * Acc@1 39.700 Acc@5 75.700
  • Worker 2 (vgg16): * Acc@1 27.300 Acc@5 60.300
  • Worker 3 (wide_resnet50_2): * Acc@1 41.300 Acc@5 81.300
  • Worker 4 (alexnet): * Acc@1 16.500 Acc@5 55.700

The actual results you see can vary greatly from what is shown here. These results are for illustrative purposes only.

Given these results, wide_resnet50_2 seems to be the most promising model architecture for this problem. The next step would be to perform further hyperparameter tuning on this architecture. resnet50 could also be explored further, but vgg16 and alexnet could be rejected as candidates for now.

Cleaning Up

Once you are done with a job, click the Terminate button to remove the data, models, and output from the trainML platform.