Skip to main content

Parallel Training Experiments with Notebooks

One of the most popular features of trainML Notebooks is the ability to copy them into new instances with only three clicks. This tutorial walks through an example of how to use the notebook copy feature to spin off training experiments to test different hyperparameters on the same model in parallel.

Create the Source Notebook

This tutorial uses the data and model code as the Get Started With Notebooks tutorial. Refer to that tutorial for a more detailed walk through of creating and using notebooks. If you already have the notebook from that tutorial running, you can skip to the next section.

Navigate to the Notebook Dashboard and click the Create button. Input a memorable name as the job name and select an available GPU Type (the code in this tutorial assumes a RTX 3090). Expand the Data section and click Add Dataset. Select Public Dataset as the dataset type and select ImageNet. Expand the Model section. Keep git selected as the Model Type and specify the tutorial code git repository url https://github.com/trainML/examples.git in the Model Code Location field to automatically download the tutorial's model code. Click Next to view a summary of the new notebook and click Create to start the notebook.

When the notebook reaches the running state, click the Open button to launch the a new window to the Jupyter environment. Navigate to models/notebooks in the file explorer pane and double clicking on the pytorch-imagenet.ipynb file. Observe the code section with the header Hyperparameters. These are the settings we will experiment with in the next step. From the Run option in the menu bar, select Run All Cells to start training. Scroll down to the bottom of the notebook to see the output from the training loop.

Running a New Experiment

While the source job is training, navigate back to the Notebooks Dashboard. To fork the existing job into a new notebook, select the job from the dashboard and click the Copy button in the dashboard menu. The Copy Notebook dialog will appear. To copy the entire notebook, including any changes made to the model code or files added, select Full (Data and Configuration) as the Copy Type. You have an opportunity to change the GPU Type, GPU Count, or Disk Size before copying, but this is not necessary for this tutorial. Give the notebook a memorable name that reflects the experiment you will run with this notebook (e.g. suffix the job name with "vgg16") and click Copy.

Once the copy process completes the notebook will automatically start. Note that during the copy process, the source notebook continued running and the training process was not interrupted. Navigate to the models/notebooks in the file explorer pane. If the source notebook finished at least one epoch prior to the copy, you will see the checkpoint files in the directory. Open the same notebook (pytorch-imagenet.ipynb) and scroll down the to Hyperparameters section. This model accepts architecture as a hyperparameter, so change the arch variable to vgg16. This architecture requires considerably more GPU memory than the default resnet18, as a result, lower the batch size (256 is a good number for the RTX 3090).

tip

If you get CUDA Out Of Memory errors, that indicates your batch size is too high. When adjusting the batch size inside a running notebook, be sure to restart the kernel after each trial to free the GPU memory from the previous attempt. The easiest way to do this is from the Kernel menu and select Restart Kernel and Run Al Cells....

Once you are done editing the hyperparameters, run all cells to start training. Now you can observe the training progress of the vgg16 model and the resnet18 model in parallel.

Next Steps

You can repeat the process as many times as needed to try combinations of architectures or other hyperparameters. Once you are done, be sure to stop the notebooks to stop billing. The next tutorials focus on getting familiar with trainML Training Jobs.