Skip to main content

NVIDIA NGC Catalog Integration

· 2 min read

trainML is making it even easier to run any GPU-enabled workload by allowing customers to use job images directly from NVIDIA's NGC Catalog.

NVIDIA's NGC Catalog provides enterprise-grade container images, including pre-trained models and industry-specific software packages. By configuring your NGC API key as a trainML Third Party Key, you are able to specify NGC container images as the basis for any job type using the customer provided environment field on the job specification.

How It Works

Create an API key for your NGC account that has access to the images you wish to use. Once you have the API key, go back to the trainML third-party key configuration page, and select NVIDIA NGC from the Add menu under Third-Party Keys. Enter the API key in the NGC API Key field and click the check button.

Go back to the NGC Catalog and find the pull command of the container you with to run. For example, to run the a specific version of the RAPIDS container, search the tags and copy the pull command, e.g. docker pull nvcr.io/nvidia/rapidsai/rapidsai:22.02-cuda11.4-runtime-ubuntu20.04.

To start a Notebook using this container image, go to the Notebook Dashboard and click Create. Select the required resources, data, and model specifications, and expand the Environment section. Select Customer Provided as the Base Environment and paste the image name from the pull command (e.g. nvcr.io/nvidia/rapidsai/rapidsai:22.02-cuda11.4-runtime-ubuntu20.04) as the Image. Additionally, since most images in NGC do not contain the jupyterlab package installed by default, you must add this to the pip package dependencies field to ensure the notebook will start properly. You can find more information about using customer provided job images here.

caution

The disk size of customer provided images count towards the disk size quota (unlike trainML environments). Ensure you are reserving enough disk space to accommodate the image size. CUDA layers can be 3+ GB alone. If the image size is greater than the requested disk space, the job will fail.

Once you submit the job, the trainML platform will automatically download the container image using your NGC account credentials, install the additional required packages, and start the notebook.