Get Started With Notebooks
The easiest way to start a new project on the trainML platform is with a trainML Notebook. This tutorial walks through the process of starting your first notebook and using it to train a PyTorch model on ImageNet.
Creating the Notebook
Navigate to the Notebooks Dashboard and click the Create
button. Input a memorable name as the job name and select an available GPU Type (the code in this tutorial assumes a RTX 3090
). Expand the Data
section and click Add Dataset
. Select Public Dataset
as the dataset type and select ImageNet
. Expand the Model
section. Keep git
selected as the Model Type
and specify the tutorial code git repository url https://github.com/trainML/examples.git
in the Model Code Location
field to automatically download the tutorial's model code. Click Next
to view a summary of the new notebook and click Create
to start the notebook.
Using the Notebook
When the notebook reaches the running
state, the Open
button will appear. Click Open
to open the Jupyter Lab environment. You will see the file explorer on the left pane with two folders input
and models
. The input
folder contains the ImageNet dataset you selected when creating the notebook. The models
folder contains the contents of the git repository you specified when creating the notebook.
On the right pane, Jupyter Lab will display the Launcher
window that allows you to create new notebooks, files, access the terminal, etc. Click Terminal
in the Other
section to open a terminal window. The default current working direct when opening a terminal is the models
folder. If you run ls
, you will see the root file structure of the git repository. Run nvidia-smi
to see details about the attached GPU. For more information on using a Jupyter Notebook, refer to the project documentation.
To start training the example model, open the example notebook by navigating to models/notebooks
in the file explorer pane and double clicking on the pytorch-imagenet.ipynb
file. This will open the notebook in the right pane. This notebook is an adapted version of the ImageNet training script provided in the PyTorch Examples repository. Scroll down to the second code section with header Hyperparameters
. Here you can see the default settings for training and modify them as needed.
From the Run
option in the menu bar, select Run All Cells
to start training. Scroll down to the bottom of the notebook to see the output from the training loop. Training will run for approximately 3 hours on an RTX 3090
. This time can be shortened or lengthened by modifying the epochs
hyperparameter.
If you are planning to continue with the Parallel Training Experiments with Notebooks tutorial, you can reuse this notebook while it is running.
Stopping and Terminating the Notebook
When training is complete, you will see the training artifacts (checkpoint.pth.tar
and model_best.pth.tar
) in the same folder as the notebook. To download these files to your local computer, right click on them and select Download
from the menu. Once you're finished, stop the job by either clicking Stop
from the trainML dashboard, or shutdown the notebook server its File menu. This will stop billing.
Closing Notebook window does not stop it, it only disconnects you from the notebook server. You will be billed for a running notebook even if you are not connected to it.
Stopped notebooks can be restarted, and will retain any modifications to the environment made in previous sessions. If you want to restart a notebook, click the Restart
button.
Stopped notebooks may be automatically purged after two weeks of non-use. Be sure to save your work before stopping a notebook.
When you are completely done with the tutorial, click the Terminate
button. This will purge the notebook environment and all its data.