Hyperparameter Tuning an Object Detection Model with MXNet on PASCAL VOC
This tutorial should cost less than 18 credits ($18) if you use the GTX 1060 or RTX 2060 Super instance type and the same setting as the guide. Tuning takes approximately 15 hours using 4 RTX 2060 Super GPUs or 30 hours using 4 GTX 1060 GPUs.
Some parts of this tutorial use advanced features that require the connection capability, ensure that your environment meets the requirements and you have installed all the prerequisites.
This tutorial uses the MXNet Faster-RCNN object detection example to perform a parallelized hyperparameter tuning job on PASCAL VOC using the hyperopt library.
Before beginning this tutorial, ensure you have created an account on the trainML platform.
Local Environment Setup
If you want to minimize the steps in this tutorial, you can clone our example code and skip to Running the Experiment.
To run this tutorial from scratch, create an empty git repository with new python environment of 3.7 or later. Install the project dependencies:
pip install mxnet gluoncv hyperopt pymongo
or
pip install -r requirements.txt
if you are using our example code.
If you are not using our code repository, add the new directory to your .gitignore file:
/bin/bash -c 'cat <!EOF >> .gitignore
mongodb/
EOF'
Hyperparameter Search Specification
If you are using our example code and skip to Running the Experiment.
Download the 0.7 version of the MXNet Faster-RCNN object detection example to the root of your code repository:
wget https://raw.githubusercontent.com/dmlc/gluon-cv/v0.7.0/scripts/detection/faster_rcnn/train_faster_rcnn.py
This file is our model code and exposes a variety of hyperparameters and configurable variables we can use in the tuning process. Only one change is necessary to this file. Replace lines 466 - 470:
if args.amp:
with amp.scale_loss(total_loss, self._optimizer) as scaled_losses:
autograd.backward(scaled_losses)
else:
total_loss.backward()
with
total_loss.backward()
Create a new file called tune.py
in the root of the directory. Add the following imports at the top:
import pickle
import time
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from hyperopt.mongoexp import MongoTrials
from types import SimpleNamespace
import sys
import mxnet as mx
from math import log
from gluoncv import utils as gutils
from gluoncv.model_zoo import get_model
from gluoncv.data.transforms.presets.rcnn import (
FasterRCNNDefaultTrainTransform,
FasterRCNNDefaultValTransform,
)
from train_faster_rcnn import get_dataset, get_dataloader, train
Objective Function
As per the hyperopt documentation, the first thing we need to do is define the objective function. The train_faster_rcnn.py
script was designed to be ran directly from the command line using command line arguments, so most of the initialization logic is stored in the if __name__ == '__main__':
block. Since hyperopt requires the training process to be fully contained inside a callable function capable of returning a loss metric, most of this logic must be replicated in our objective function.
Define the objective function as the following:
def objective(space):
sys.setrecursionlimit(1100)
args = SimpleNamespace(**space)
# fix seed for mxnet, numpy and python builtin random generator.
gutils.random.seed(args.seed)
ctx = [mx.gpu(int(i)) for i in args.gpus.split(",") if i.strip()]
ctx = ctx if ctx else [mx.cpu()]
# training data
train_dataset, val_dataset, eval_metric = get_dataset(args.dataset, args)
# network
kwargs = {}
module_list = []
if args.use_fpn:
module_list.append("fpn")
if args.norm_layer is not None:
module_list.append(args.norm_layer)
if args.norm_layer == "syncbn":
kwargs["num_devices"] = len(ctx)
num_gpus = len(ctx)
net_name = "_".join(("faster_rcnn", *module_list, args.network, args.dataset))
net = get_model(
net_name,
pretrained_base=True,
per_device_batch_size=args.batch_size // num_gpus,
**kwargs
)
args.save_prefix += net_name
for param in net.collect_params().values():
if param._data is not None:
continue
param.initialize()
net.collect_params().reset_ctx(ctx)
batch_size = args.batch_size
train_data, val_data = get_dataloader(
net,
train_dataset,
val_dataset,
FasterRCNNDefaultTrainTransform,
FasterRCNNDefaultValTransform,
batch_size,
len(ctx),
args,
)
# training
train(net, train_data, val_data, eval_metric, batch_size, ctx, args)
name, values = eval_metric.get()
idx = name.index('mAP')
return {
"loss": 1 - values[idx],
"status": STATUS_OK,
"eval_time": time.time(),
}
Almost all the lines from the sys.setrecursionlimit(1100)
to train(net, train_data, val_data, eval_metric, batch_size, ctx, args)
are directly copied from the train_faster_rcnn.py
script. Some lines have been removed because that functionality will not be used in this example. One notable change is that the second line args = parse_args()
has been changed to args = SimpleNamespace(**space)
. This enables us to substitute a dictionary of options for the command line arguments the script is expecting.
The train_faster_rcnn.py
script does not return any statistics about the model's performance directly, which hyperopt requires to perform the tuning. Instead, you can access any configured metrics through the eval_metric
object. In this example, the object contains the average precision for each class as well as the Mean Average Precision (mAP). The lines following the train
call obtain the mAP metric and set the training run's loss at one minus the metric. Hyperopt is a minimizer, so we want it to minimize the model's imprecision.
Defining the Search Space
The next step is to provide a list of hyperparameters for hyperopt to optimize over. In this case, since we are overridding the command line arguments the script is expecting, we have to include a long list of static parameters. Add the following to your tune.py
file.
space = {
'network': 'resnet50_v1b',
'dataset': 'voc',
'save_prefix': '',
'horovod': False,
'amp': False,
'resume': False,
'start_epoch': 0,
'verbose': False,
'custom_model': False,
'kv_store': 'nccl',
'log_interval': 100,
'save_interval': 1,
'val_interval': 1,
'disable_hybridization': False,
'static_alloc': False,
'seed': 233,
'mixup': False,
'norm_layer': None,
'use_fpn': False,
'num_workers': 4,
'gpus': '0',
'executor_threads': 1,
'epochs': 1,
'batch_size': 2,
'lr': 0.001,
'lr_decay': 0.1,
'lr_decay_epoch': '14,20',
'lr_warmup': -1,
'lr_warmup_factor': 1. / 3.,
'momentum': hp.uniform('momentum', 0,1),
'wd': hp.loguniform('wd', log( 1e-5 ), log( 100 )),
'rpn_smoothl1_rho': 1. / 9.,
'rcnn_smoothl1_rho': 1.,
}
If you are using a GTX 1060, change batch_size
to 1
.
With the above example, only the momentum
and wd
parameters are being included in the hyperparameter tuning by defining them as hyperopt stochastic expressions. You can define additional parameters like rpn_smoothl1_rho
or rcnn_smoothl1_rho
similarly. The number of hyperparameters you tune will not change the duration of the experiment, but can change the outcome.
Defining the Experiment
The final piece of the tune.py
code is to define a Trials database and actually run the minimization function. Since we want to perform a distributed, parallel hyperparameter tuning experiment, we can use MongoDB to enable hyperopt parallelization
Add the following to the bottom of your tune.py
file.
if __name__ == '__main__':
trials = MongoTrials('mongo://localhost:27017/hyperopt/jobs', exp_key='mxnet_pascal_voc_1')
best = fmin(
objective,
space=space,
algo=tpe.suggest,
max_evals=50,
trials=trials,
max_queue_len=4
)
print(best)
The trials
line indicates that we will be using a local MongoDB instance on port 27017 to coordinate the parallel workers. Per the hyperopt documentation, the collection you use must be called jobs
. Using the exp_key
allows you to run different experiments using the same MongoDB instance.
You can modify the algo
and max_evals
arguments to the fmin
function to use a different search pattern or increase or decrease the duration of the experiment. The max_queue_len
must be at least as big as the number of parallel workers you plan to use. If there are more workers than the queue length, those workers will not find available workloads and terminate.
Commit and push your code to the remote repository.
Running the Experiment
Starting Local Resources
In a new terminal window, navigate to the root of the repository and run:
docker run -v $(pwd)/mongodb:/data/db -p 27017:27017 mongo
This will start a local mongodb instance, saving its data in the mongodb folder from the Local Environment Setup. In another terminal window in the root of the repository, run:
python tune.py
You should see a progress bar like the following:
0%| | 0/50 [00:00<?, ?trial/s, best loss=?
no last_id found, re-trying
Hyperopt will wait indefinitely until workers begin to connect to it and process the jobs. Keep this window open and running until the experiment completes.
Starting Job Workers
Login to the trainML platform and click the Create a Training Job
link on the Home screen or the Create
button from the Training Jobs page to open a new job form. Enter a memorable name for the Job Name like MXNet Hyperopt Object Detection
. Select the RTX 2060 Super
GPU Type and leave the GPUs Per Worker as 1.
In the Data
section, select Public Dataset
from the Dataset Type
field, then select PASCAL VOC
from the Dataset
field. This will automatically load the ImageNet dataset into the /opt/trainml/input
directory of each job worker. Since this is a trainML supplied dataset, you will incur no additional storage cost for using this data. Leave Output Type
empty. In this case, we will not save any output of the worker themselves because the relevant information will be stored in MongoDb.
In the Model
section, specify the URL of the PyTorch Examples repository: example code
.
In the Workers
section, select 4
from the Number of Workers
field to align with our queue length setting. Since hyperopt manages what each worker does, you can specify the Same Command for All Workers
option. In the command field, enter
python pascal_voc.py --download-dir $TRAINML_DATA_PATH && hyperopt-mongo-worker --mongo=${TRAINML_CLIENT_IP}:27017/hyperopt --poll-interval=1
The first part of the command ensures that the workers copy the data set into the directory MXNet is expecting. The second part runs the hyperopt mongo worker and configures it to pull the mongodb located on your local computer (by using the TRAINML_CLIENT_IP
environment variable).
Click Next
and Submit
to start the job.
Monitoring and Reviewing Results
Connect to the job using the connection capability in a separate terminal window. Once you connect, the hyperopt workers will start and you will begin seeing output like the following:
Worker 1: Starting...
Worker 1: INFO:hyperopt.mongoexp:PROTOCOL mongo
Worker 1: INFO:hyperopt.mongoexp:USERNAME None
Worker 1: INFO:hyperopt.mongoexp:HOSTNAME 10.253.43.253
Worker 1: INFO:hyperopt.mongoexp:PORT 27017
Worker 1: INFO:hyperopt.mongoexp:PATH /hyperopt/jobs
Worker 1: INFO:hyperopt.mongoexp:AUTH DB None
Worker 1: INFO:hyperopt.mongoexp:DB hyperopt
Worker 1: INFO:hyperopt.mongoexp:COLLECTION jobs
[...]
Worker 1: INFO:root:[Epoch 0][Batch 99], Speed: 4.686 samples/sec, RPN_Conf=0.458,RPN_SmoothL1=0.072,RCNN_CrossEntropy=0.849,RCNN_SmoothL1=0.392,RPNAcc=0.911,RPNL1Loss=0.509,RCNNAcc=0.817,RCNNL1Loss=2.439
Worker 1: INFO:root:[Epoch 0][Batch 199], Speed: 4.701 samples/sec, RPN_Conf=0.326,RPN_SmoothL1=0.069,RCNN_CrossEntropy=0.751,RCNN_SmoothL1=0.403,RPNAcc=0.924,RPNL1Loss=0.510,RCNNAcc=0.824,RCNNL1Loss=2.403
If you are doing 50 trials, ths process will take several hours to complete.
You must keep the trainML connection utility, the python tune.py
script, and the MongoDB docker container running for the full duration of the experiment. If any one of the 3 stop, the workers will no longer be able to report their results and find new workloads and will eventually terminate.
Once all 50 trials are evaluated, the final statement should print out the results of the experiment, for example:
best loss: 0.3945681984212632
{ 'momentum': 0.9068445825679892, 'wd': 0.00012796562296570383 }
Based on this, I can now proceed with a marathon training run using those hyperparameters for my model.