Skip to main content

Running a Notebook Job

Notebook jobs are ideal for the initial trial runs of a model architecture, verifying your code runs without error, or for getting familiar with the trainML platform.

Starting a Job

Click the Start a Notebook link on the Home screen or the Create button from the Notebooks page to open a new job form. Enter a job name that will uniquely identify this job for you. Select the type of GPU you would like to use by clicking an available GPU card. Select how many GPUs you want attached to this notebook instance in the GPU Count field. A maximum of 4 GPUs per notebook instance is allowed. If any options in these fields are disabled, there are not enough GPUs of that type available to satisfy your request. Specify the amount of disk space you want allocated for this job's working directory in the Disk Size field. Be sure to allocate enough space to complete this job, as this allocation cannot be changed once the job is created.

Additional job configuration options are available but are not required. Review the options here, configure them as desired, and click Next.

On the job review page, it will confirm the settings you have provided and display the performance characteristics of the GPU type selected. You must have at least 1 hours worth of credits to start a job. When you start the job, it will automatically deduct one hours worth of credits from your account. If you stop the job before an hour elapses, you will be refunded credits for the time you did not use.

Using the Notebook

Once a job successfully starts, the dashboard should indicate that the job is in the running state. Click the Open button to access the notebook. The Jupyter Lab interface will open a new tab. If you did not already download or attach a dataset or model code, start a terminal window to download code from a git repository, download data from other cloud storage providers. You can also view the GPU status by running nvidia-smi in a terminal window.

tip

It is not recommended to upload or download large data files using the Jupyter file upload and download features, as bandwidth is rate limited. Instead use the Jupyter terminal to upload and download content directly git or other cloud storage providers.

Unlike some other cloud services, the notebook server will continue to run if you disconnect from it and will not terminate until you stop the job or shutdown the notebook server its File menu. For more information on using a Jupyter Notebook, refer to the project documentation.

caution

If you are editing code directly in a notebook instance, be sure to download the code to your local computer or upload to code repository regularly. If a job fails, the local storage of the notebook server may not be recoverable.

If you disconnect while training is running since of a notebook cell, the output will not be captured when you reconnect. This is a limitation of Jupyter Notebooks. To ensure you have access to the training log output and can continue monitoring the progress after a disconnection, we recommend you output the log message to a file in addition to the console. See this issue and that issue for additional details.

Stopping, Restarting, and Terminating a Job

When you are done actively using the instance, you can stop the job by either clicking Stop from the trainML dashboard, or shutdown the notebook server its File menu. This will stop billing for the job.

Warning

Closing Notebook window does not stop the job, it only disconnects you from the notebook server. You will be billed for a running job even if you are not connected to it.

Stopped jobs can be restarted, and will retain any modifications to the environment made in previous sessions. If you want to restart a job, click the Restart button.

caution

Stopped jobs may be automatically purged after two weeks of non-use. Be sure to save your work before stopping a job.

If you are finished with a job, click the Terminate button. This will purge the job environment and all its data. If a job fails, it cannot be restarted, only terminated.