Hyperparameter Tuning an Object Detection Model with MXNet on PASCAL VOC

This tutorial should cost less than 18 credits ($18) if you use the GTX 1060 or RTX 2060 Super instance type and the same setting as the guide. Tuning takes approximately 15 hours using 4 RTX 2060 Super GPUs or 30 hours using 4 GTX 1060 GPUs.

info

Some parts of this tutorial use advanced features that require the connection capability, ensure that your environment meets the requirements and you have installed all the prerequisites.

This tutorial uses the MXNet Faster-RCNN object detection example to perform a parallelized hyperparameter tuning job on PASCAL VOC using the hyperopt library.

Before beginning this tutorial, ensure you have created an account on the trainML platform.

Local Environment Setup

tip

If you want to minimize the steps in this tutorial, you can clone our example code and skip to Running the Experiment.

To run this tutorial from scratch, create an empty git repository with new python environment of 3.7 or later. Install the project dependencies:

pip install mxnet gluoncv hyperopt pymongo

pip install -r requirements.txt

if you are using our example code.

If you are not using our code repository, add the new directory to your .gitignore file:

/bin/bash -c 'cat <!EOF >> .gitignore
mongodb/
EOF'

Hyperparameter Search Specification

tip

If you are using our example code and skip to Running the Experiment.

Download the 0.7 version of the MXNet Faster-RCNN object detection example to the root of your code repository:

wget https://raw.githubusercontent.com/dmlc/gluon-cv/v0.7.0/scripts/detection/faster_rcnn/train_faster_rcnn.py

This file is our model code and exposes a variety of hyperparameters and configurable variables we can use in the tuning process. Only one change is necessary to this file. Replace lines 466 - 470:

            if args.amp:
                with amp.scale_loss(total_loss, self._optimizer) as scaled_losses:
                    autograd.backward(scaled_losses)
            else:
                total_loss.backward()

with

            total_loss.backward()

Create a new file called tune.py in the root of the directory. Add the following imports at the top:

import pickle
import time
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from hyperopt.mongoexp import MongoTrials
from types import SimpleNamespace
import sys
import mxnet as mx
from math import log
from gluoncv import utils as gutils
from gluoncv.model_zoo import get_model
from gluoncv.data.transforms.presets.rcnn import (
    FasterRCNNDefaultTrainTransform,
    FasterRCNNDefaultValTransform,
)
from train_faster_rcnn import get_dataset, get_dataloader, train

Objective Function

As per the hyperopt documentation, the first thing we need to do is define the objective function. The train_faster_rcnn.py script was designed to be ran directly from the command line using command line arguments, so most of the initialization logic is stored in the if __name__ == '__main__': block. Since hyperopt requires the training process to be fully contained inside a callable function capable of returning a loss metric, most of this logic must be replicated in our objective function.

Define the objective function as the following:

def objective(space):

    sys.setrecursionlimit(1100)
    args = SimpleNamespace(**space)
    # fix seed for mxnet, numpy and python builtin random generator.
    gutils.random.seed(args.seed)

    ctx = [mx.gpu(int(i)) for i in args.gpus.split(",") if i.strip()]
    ctx = ctx if ctx else [mx.cpu()]

    # training data
    train_dataset, val_dataset, eval_metric = get_dataset(args.dataset, args)

    # network
    kwargs = {}
    module_list = []
    if args.use_fpn:
        module_list.append("fpn")
    if args.norm_layer is not None:
        module_list.append(args.norm_layer)
        if args.norm_layer == "syncbn":
            kwargs["num_devices"] = len(ctx)

    num_gpus = len(ctx)
    net_name = "_".join(("faster_rcnn", *module_list, args.network, args.dataset))

    net = get_model(
        net_name,
        pretrained_base=True,
        per_device_batch_size=args.batch_size // num_gpus,
        **kwargs
    )
    args.save_prefix += net_name

    for param in net.collect_params().values():
        if param._data is not None:
            continue
        param.initialize()
    net.collect_params().reset_ctx(ctx)
    batch_size =  args.batch_size
    train_data, val_data = get_dataloader(
        net,
        train_dataset,
        val_dataset,
        FasterRCNNDefaultTrainTransform,
        FasterRCNNDefaultValTransform,
        batch_size,
        len(ctx),
        args,
    )

    # training
    train(net, train_data, val_data, eval_metric, batch_size, ctx, args)
    name, values = eval_metric.get()
    idx = name.index('mAP')

    return {
        "loss": 1 - values[idx],
        "status": STATUS_OK,
        "eval_time": time.time(),
    }

Almost all the lines from the sys.setrecursionlimit(1100) to train(net, train_data, val_data, eval_metric, batch_size, ctx, args) are directly copied from the train_faster_rcnn.py script. Some lines have been removed because that functionality will not be used in this example. One notable change is that the second line args = parse_args() has been changed to args = SimpleNamespace(**space). This enables us to substitute a dictionary of options for the command line arguments the script is expecting.

The train_faster_rcnn.py script does not return any statistics about the model's performance directly, which hyperopt requires to perform the tuning. Instead, you can access any configured metrics through the eval_metric object. In this example, the object contains the average precision for each class as well as the Mean Average Precision (mAP). The lines following the train call obtain the mAP metric and set the training run's loss at one minus the metric. Hyperopt is a minimizer, so we want it to minimize the model's imprecision.

Defining the Search Space

The next step is to provide a list of hyperparameters for hyperopt to optimize over. In this case, since we are overridding the command line arguments the script is expecting, we have to include a long list of static parameters. Add the following to your tune.py file.

space = {
    'network': 'resnet50_v1b',
    'dataset': 'voc',
    'save_prefix': '',
    'horovod': False,
    'amp': False,
    'resume': False,
    'start_epoch': 0,
    'verbose': False,
    'custom_model': False,
    'kv_store': 'nccl',
    'log_interval': 100,
    'save_interval': 1,
    'val_interval': 1,
    'disable_hybridization': False,
    'static_alloc': False,
    'seed': 233,
    'mixup': False,
    'norm_layer': None,
    'use_fpn': False,
    'num_workers': 4,
    'gpus': '0',
    'executor_threads': 1,
    'epochs': 1,
    'batch_size': 2,
    'lr': 0.001,
    'lr_decay': 0.1,
    'lr_decay_epoch': '14,20',
    'lr_warmup': -1,
    'lr_warmup_factor': 1. / 3.,
    'momentum': hp.uniform('momentum', 0,1),
    'wd': hp.loguniform('wd', log( 1e-5 ), log( 100 )),
    'rpn_smoothl1_rho': 1. / 9.,
    'rcnn_smoothl1_rho': 1.,

}

info

If you are using a GTX 1060, change batch_size to 1.

With the above example, only the momentum and wd parameters are being included in the hyperparameter tuning by defining them as hyperopt stochastic expressions. You can define additional parameters like rpn_smoothl1_rho or rcnn_smoothl1_rho similarly. The number of hyperparameters you tune will not change the duration of the experiment, but can change the outcome.

Defining the Experiment

The final piece of the tune.py code is to define a Trials database and actually run the minimization function. Since we want to perform a distributed, parallel hyperparameter tuning experiment, we can use MongoDB to enable hyperopt parallelization

Add the following to the bottom of your tune.py file.

if __name__ == '__main__':
    trials = MongoTrials('mongo://localhost:27017/hyperopt/jobs', exp_key='mxnet_pascal_voc_1')
    best = fmin(
        objective,
        space=space,
        algo=tpe.suggest,
        max_evals=50,
        trials=trials,
        max_queue_len=4
    )

    print(best)

The trials line indicates that we will be using a local MongoDB instance on port 27017 to coordinate the parallel workers. Per the hyperopt documentation, the collection you use must be called jobs. Using the exp_key allows you to run different experiments using the same MongoDB instance.

You can modify the algo and max_evals arguments to the fmin function to use a different search pattern or increase or decrease the duration of the experiment. The max_queue_len must be at least as big as the number of parallel workers you plan to use. If there are more workers than the queue length, those workers will not find available workloads and terminate.

Commit and push your code to the remote repository.

Running the Experiment

Starting Local Resources

In a new terminal window, navigate to the root of the repository and run:

docker run -v $(pwd)/mongodb:/data/db -p 27017:27017 mongo

This will start a local mongodb instance, saving its data in the mongodb folder from the Local Environment Setup. In another terminal window in the root of the repository, run:

python tune.py

You should see a progress bar like the following:

  0%|                                    | 0/50 [00:00<?, ?trial/s, best loss=?
  no last_id found, re-trying

Hyperopt will wait indefinitely until workers begin to connect to it and process the jobs. Keep this window open and running until the experiment completes.

Starting Job Workers

Login to the trainML platform and click the Create a Training Job link on the Home screen or the Create button from the Training Jobs page to open a new job form. Enter a memorable name for the Job Name like MXNet Hyperopt Object Detection. Select the RTX 2060 Super GPU Type and leave the GPUs Per Worker as 1.

In the Data section, select Public Dataset from the Dataset Type field, then select PASCAL VOC from the Dataset field. This will automatically load the ImageNet dataset into the /opt/trainml/input directory of each job worker. Since this is a trainML supplied dataset, you will incur no additional storage cost for using this data. Leave Output Type empty. In this case, we will not save any output of the worker themselves because the relevant information will be stored in MongoDb.

In the Model section, specify the URL of the PyTorch Examples repository: example code.

In the Workers section, select 4 from the Number of Workers field to align with our queue length setting. Since hyperopt manages what each worker does, you can specify the Same Command for All Workers option. In the command field, enter

python pascal_voc.py --download-dir $TRAINML_DATA_PATH && hyperopt-mongo-worker --mongo=${TRAINML_CLIENT_IP}:27017/hyperopt --poll-interval=1

The first part of the command ensures that the workers copy the data set into the directory MXNet is expecting. The second part runs the hyperopt mongo worker and configures it to pull the mongodb located on your local computer (by using the TRAINML_CLIENT_IP environment variable).

Click Next and Submit to start the job.

Monitoring and Reviewing Results

Connect to the job using the connection capability in a separate terminal window. Once you connect, the hyperopt workers will start and you will begin seeing output like the following:

Worker 1: Starting...
Worker 1: INFO:hyperopt.mongoexp:PROTOCOL mongo
Worker 1: INFO:hyperopt.mongoexp:USERNAME None
Worker 1: INFO:hyperopt.mongoexp:HOSTNAME 10.253.43.253
Worker 1: INFO:hyperopt.mongoexp:PORT 27017
Worker 1: INFO:hyperopt.mongoexp:PATH /hyperopt/jobs
Worker 1: INFO:hyperopt.mongoexp:AUTH DB None
Worker 1: INFO:hyperopt.mongoexp:DB hyperopt
Worker 1: INFO:hyperopt.mongoexp:COLLECTION jobs
[...]
Worker 1: INFO:root:[Epoch 0][Batch 99], Speed: 4.686 samples/sec, RPN_Conf=0.458,RPN_SmoothL1=0.072,RCNN_CrossEntropy=0.849,RCNN_SmoothL1=0.392,RPNAcc=0.911,RPNL1Loss=0.509,RCNNAcc=0.817,RCNNL1Loss=2.439
Worker 1: INFO:root:[Epoch 0][Batch 199], Speed: 4.701 samples/sec, RPN_Conf=0.326,RPN_SmoothL1=0.069,RCNN_CrossEntropy=0.751,RCNN_SmoothL1=0.403,RPNAcc=0.924,RPNL1Loss=0.510,RCNNAcc=0.824,RCNNL1Loss=2.403

If you are doing 50 trials, ths process will take several hours to complete.

Warning

You must keep the trainML connection utility, the python tune.py script, and the MongoDB docker container running for the full duration of the experiment. If any one of the 3 stop, the workers will no longer be able to report their results and find new workloads and will eventually terminate.

Once all 50 trials are evaluated, the final statement should print out the results of the experiment, for example:

best loss: 0.3945681984212632
{ 'momentum': 0.9068445825679892, 'wd': 0.00012796562296570383 }

Based on this, I can now proceed with a marathon training run using those hyperparameters for my model.

Local Environment Setup​

Hyperparameter Search Specification​

Objective Function​

Defining the Search Space​

Defining the Experiment​

Running the Experiment​

Starting Local Resources​

Starting Job Workers​

Monitoring and Reviewing Results​