Parallel Model Architecture Search With Training Jobs
This tutorial should cost less than 0.8 credits ($0.80) if you use the GTX 1060 instance type and the same training settings as the guide.
Some parts of this tutorial use advanced features that require the connection capability, ensure that your environment meets the requirements and you have installed all the prerequisites.
This tutorial uses the PyTorch example models to perform a parallel model architecture search on ImageNet.
Model Architecture Search
Using the same procedure as the getting started guide instructs, you can PyTorch Examples Repository to test what computer vision model architecture is most promising.
Creating the Training Job
Login to the trainML platform and click the Create a Training Job
link on the Home screen or the Create
button from the Training Jobs page to open a new job form. Enter a memorable name for the Job Name like PyTorch ImageNet Architecture Search
. Select the cheapest available GPU Type and leave the GPUs Per Worker as 1.
In the Data
section, select Public Dataset
from the Dataset Type
field, then select ImageNet
from the Dataset
field. This will automatically load the ImageNet dataset into the /opt/trainml/input
directory of each job worker. Since this is a trainML supplied dataset, you will incur no additional storage cost for using this data. For Output Type
select Local
and specify an existing directory on your local computer, like ~/pytorch-example/output
in the Artifact output storage path
field.
In the Model
section, specify the URL of the PyTorch Examples repository, https://github.com/pytorch/examples
.
The ImageNet example allows you to specify which model architecture to train with. You can see the available list here. This tutorial will compare resnet50
, vgg16
, wide_resnet50_2
, and alexnet
. To train all four models in parallel, select 4
from the Number of Workers
field in the Workers
section and ensure that Different Command for Each Worker
is selected. Enter the following commands into the Command for Worker
fields:
- Worker 1 (resnet50):
cd $TRAINML_OUTPUT_PATH && python $TRAINML_MODEL_PATH/imagenet/main.py --batch-size 32 --epochs 2 --arch resnet50 $TRAINML_DATA_PATH 2>&1 | tee $TRAINML_OUTPUT_PATH/train.log
- Worker 2 (vgg16):
cd $TRAINML_OUTPUT_PATH && python $TRAINML_MODEL_PATH/imagenet/main.py --batch-size 32 --epochs 2 --arch vgg16 --lr 0.01 $TRAINML_DATA_PATH 2>&1 | tee $TRAINML_OUTPUT_PATH/train.log
- Worker 3 (wide_resnet50_2):
cd $TRAINML_OUTPUT_PATH && python $TRAINML_MODEL_PATH/imagenet/main.py --batch-size 32 --epochs 2 --arch wide_resnet50_2 $TRAINML_DATA_PATH 2>&1 | tee $TRAINML_OUTPUT_PATH/train.log
- Worker 4 (alexnet):
cd $TRAINML_OUTPUT_PATH && python $TRAINML_MODEL_PATH/imagenet/main.py --batch-size 32 --epochs 2 --arch alexnet --lr 0.01 $TRAINML_DATA_PATH 2>&1 | tee $TRAINML_OUTPUT_PATH/train.log
This makes use of the trainML preset environment variables to make specifying the data, model, and output directories easier. First, since the PyTorch Example script saves its checkpoints to the current working directory, you need to set the working directory of the python process to be TRAINML_OUTPUT_PATH
in order for these checkpoints to be uploaded to your artifact storage location. To do this, the python command is prefixed with cd $TRAINML_OUTPUT_PATH &&
and the model file is called using it's absolute path $TRAINML_MODEL_PATH/imagenet/main.py
. Second, since the PyTorch Example does not create log output for TensorBoard or a similar framework (if you want to enable this, follow the guide here), you can save the script output to a file that will be uploaded to the artifact storage location by postfixing 2>&1 | tee $TRAINML_OUTPUT_PATH/train.log
to the command. The tee
command allows you to both save the output to a file and continue to send the output to standard out. If you redirect standard out without using the tee command, the file will be populated, but you will not see any output in the log stream viewer.
The main part of the command uses the recommendations from the repository. The batch size is set appropriately to avoid running out of GPU memory, and the epochs is set to show the accuracy difference between the models without using too many credits. The --arch
option is what enables each worker to train on a different model architecture.
Click Next
to review your settings. Click Create
to start the job.
Monitoring the Job Worker
To monitor the progress of the job, connect to the job by following the instructions in the getting started guide. Once the worker initialized, you will begin to see output like the following:
Worker 1: Epoch: [0][200/791] Time 0.572 ( 0.222) Data 0.548 ( 0.176) Loss 3.0392e+00 (3.0217e+00) Acc@1 0.00 ( 5.16) Acc@5 28.12 ( 25.79)
Worker 1: Epoch: [0][210/791] Time 0.710 ( 0.227) Data 0.677 ( 0.181) Loss 3.0638e+00 (3.0218e+00) Acc@1 0.00 ( 5.15) Acc@5 12.50 ( 25.83)
Connect to the job using the connection utility so that the job workers can upload their results to your local computer.
When training completes, the worker output the final accuracy and loss metrics. If you return to the trainML dashboard, you will see the job has stopped automatically.
Analyzing the Output
Depending on size of the model artifact directory, it could take several minutes for the training output file to appear in the artifact storage location. The vgg16 model checkpoints, in particular, are large (almost 1 GB each) and the upload could take up to an hour.
Analyzing Results
Once the job terminates, check the output path you specified in when creating the job. There should be four files with a naming convention of <job_name>_<worker_number>.zip
, for example PyTorch_ImageNet_Architecture_Search.zip
, PyTorch_ImageNet_Architecture_Search_2.zip
, etc. Download these files, and extract their contents.
Each file a train.log
file, as well as two checkpoint files, one that is the last one to be generated checkpoint.pth.tar
and one that had the best accuracy score model_best.pth.tar
. Open all four train.log
files and scroll to the end of each file. The last line contains the final accuracy score for each of the model architectures. You will see something similar to the following at the end of each worker's file:
- Worker 1 (resnet50):
* Acc@1 39.700 Acc@5 75.700
- Worker 2 (vgg16):
* Acc@1 27.300 Acc@5 60.300
- Worker 3 (wide_resnet50_2):
* Acc@1 41.300 Acc@5 81.300
- Worker 4 (alexnet):
* Acc@1 16.500 Acc@5 55.700
The actual results you see can vary greatly from what is shown here. These results are for illustrative purposes only.
Given these results, wide_resnet50_2
seems to be the most promising model architecture for this problem. The next step would be to perform further hyperparameter tuning on this architecture. resnet50
could also be explored further, but vgg16
and alexnet
could be rejected as candidates for now.
Cleaning Up
Once you are done with a job, click the Terminate
button to remove the data, models, and output from the trainML platform.