Parallel Training Experiments with Notebooks
One of the most popular features of trainML Notebooks is the ability to copy them into new instances with only three clicks. This tutorial walks through an example of how to use the notebook copy feature to spin off training experiments to test different hyperparameters on the same model in parallel.
Create the Source Notebook
This tutorial uses the data and model code as the Get Started With Notebooks tutorial. Refer to that tutorial for a more detailed walk through of creating and using notebooks. If you already have the notebook from that tutorial running, you can skip to the next section.
Navigate to the Notebooks Dashboard and click the
Create button. Input a memorable name as the job name and select an available GPU Type (the code in this tutorial assumes a
RTX 3090). Expand the
Data section and click
Add Dataset. Select
Public Dataset as the dataset type and select
ImageNet. Expand the
Model section. Keep
git selected as the
Model Type and specify the tutorial code git repository url
https://github.com/trainML/examples.git in the
Model Code Location field to automatically download the tutorial's model code. Click
Next to view a summary of the new notebook and click
Create to start the notebook.
When the notebook reaches the
running state, click the
Open button to launch the a new window to the Jupyter environment. Navigate to
models/notebooks in the file explorer pane and double clicking on the
pytorch-imagenet.ipynb file. Observe the code section with the header
Hyperparameters. These are the settings we will experiment with in the next step. From the
Run option in the menu bar, select
Run All Cells to start training. Scroll down to the bottom of the notebook to see the output from the training loop.
Running a New Experiment
While the source job is training, navigate back to the Notebooks Dashboard. To fork the existing job into a new notebook, select the job from the dashboard and click the
Copy button in the dashboard menu. The Copy Notebook dialog will appear. To copy the entire notebook, including any changes made to the model code or files added, select
Full (Data and Configuration) as the
Copy Type. You have an opportunity to change the
GPU Count, or
Disk Size before copying, but this is not necessary for this tutorial. Give the notebook a memorable name that reflects the experiment you will run with this notebook (e.g. suffix the job name with "vgg16") and click
Once the copy process completes the notebook will automatically start. Note that during the copy process, the source notebook continued running and the training process was not interrupted. Navigate to the
models/notebooks in the file explorer pane. If the source notebook finished at least one epoch prior to the copy, you will see the checkpoint files in the directory. Open the same notebook (
pytorch-imagenet.ipynb) and scroll down the to
Hyperparameters section. This model accepts architecture as a hyperparameter, so change the
arch variable to
vgg16. This architecture requires considerably more GPU memory than the default
resnet18, as a result, lower the batch size (
256 is a good number for the RTX 3090).
If you get CUDA Out Of Memory errors, that indicates your batch size is too high. When adjusting the batch size inside a running notebook, be sure to restart the kernel after each trial to free the GPU memory from the previous attempt. The easiest way to do this is from the
Kernel menu and select
Restart Kernel and Run Al Cells....
Once you are done editing the hyperparameters, run all cells to start training. Now you can observe the training progress of the
vgg16 model and the
resnet18 model in parallel.
You can repeat the process as many times as needed to try combinations of architectures or other hyperparameters. Once you are done, be sure to stop the notebooks to stop billing. The next tutorials focus on getting familiar with trainML Training Jobs.