Skip to main content

Checkpoints

Checkpoints are a great option to store an immutable version of a model's weights for reuse in other jobs. Checkpoints only will only incur storage charges for their size, and can be used on unlimited jobs simultaneously. Storing checkpoints separately from models gives you the flexibility to make code updates without changing the weights, or reuse the same code with different weights.

Public Checkpoints

Public checkpoints are a collection of popular public domain machine learning checkpoint files that are loaded by trainML. If you are planning to use one of the below checkpoints in your model, be sure to select it in the job form as instructed below instead of provisioning worker storage and downloading it yourself.

caution

If you attach a public checkpoint to your job, you accept the terms and conditions documented in any license, readme, or model card file present in the mounted checkpoint directory or at the associated URL listed below.

File-based Checkpoints:

Diffusers Compatible Checkpoints:

Language Model Checkpoints:

Speech Recognition Checkpoints:

If you would like a public checkpoint added, please contact us with a link to the checkpoint and a brief description of what you need it for.

Using a Public Checkpoint

Public checkpoints can be used by checking the Public checkbox in the Model section of the job form. Select the desired checkpoint from the list and create the job. Once the job is running you can access the checkpoint in the /opt/trainml/checkpoint directory, or using the TRAINML_CHECKPOINT_PATH environment variable.

Private Checkpoints

Private checkpoints enable you to load a checkpoint once and reuse that checkpoint on any future job or multiple jobs at the same time while only incurring storage charges once based on the size of the checkpoint. Private checkpoints are also included in the 50 GB free storage allotment. Checkpoints are immutable to prevent unexpected data changes impacting jobs. If you need to revise a checkpoint, you must create a new one and remove the old one. The maximum size of any checkpoint is 500 GB, but you can have unlimited checkpoints.

Creating a Checkpoint

Checkpoints can be created from three different sources: external, notebooks, and training/inference job output.

External Checkpoint Source

To create a checkpoint from an external sources, navigate to the Checkpoints Dashboard from the side navigation and click the Create button. Specify the name of the new checkpoint in the Name field and then select the Source Type of the location from which to populate the new checkpoint:

  • AWS: Select this option if the checkpoint data resides on Amazon S3.
  • Azure: Select this option if the checkpoint data resides on Azure Blob Storage.
  • Hugging Face: Select this option if the checkpoint data resides in a Hugging Face repository.
  • GCP: Select this option if the checkpoint data resides on Google Cloud Storage.
  • Git: Select this option if the checkpoint data resides in a git repository.
  • Kaggle: Select this option if the checkpoint data is from a Kaggle Competition, Dataset, or Kernel.
  • Local: Select this option if the checkpoint data resides on your local computer. You will be required to connect to the checkpoint for this option to work. Jobs using the local storage options will wait indefinitely for you to connect.
  • Regional Datastore - Select this option to mount the checkpoint directly from a Regional Datastore in an existing CloudBender region.
  • Wasabi - Select this option if the checkpoint data resides on Wasabi Storage.
  • Web: Select this option if the checkpoint data resides on a publicly accessible HTTP or FTP server.

Specify the path of the checkpoint data within the storage type specified in the Path field. If you specify a compressed file (zip, tar, tar.gz, or bz2), the file will be automatically extracted. If you specify a directory path (ending in /), it will run a sync starting from the path provided, downloading all files and subdirectories from the provided path. Valid paths for each Source Type are the following:

  • AWS: Must begin with s3://.
  • Azure: Must begin with https://.
  • Hugging Face: Must be in the format <namespace>/<repo>.
  • GCP: Must begin with gs://.
  • Git: Both http and ssh git repository formats are supported. To use the ssh format, you must configure a git ssh key.
  • Kaggle: Must be the short name of the competition, dataset, or kernel compatible with the Kaggle API.
  • Local: Must begin with / (absolute path), ~/ (home directory relative), or $ (environment variable path). Relative paths (using ./) are not supported.
  • Regional Datastore - Must begin with / (absolute path)
  • Wasabi: Must begin with s3://.
  • Web: Must begin with http://, https://, ftp://, or ftps://.
Source Specific fields
  • Type (Kaggle Only): The type of Kaggle data you are specifying, Competition, Dataset, or Kernel (Notebook).
  • Endpoint (Wasabi Only): The service URL of the Wasabi bucket you are using.
  • Path (Regional Datastore Only): The subdirectory inside the regional datastore to load the data from. Use / to load the entire datastore.
  • Branch(Hugging Face Only): The branch to download (if not the default).

Click Create to start populating the checkpoint. If you selected any option except Local, the checkpoint download will take place automatically and the checkpoint will change to a state of ready when it is complete. If selected Local, you must connect to the checkpoint by selecting the checkpoint and clicking the Connect button to proceed with the data population.

Notebooks

To create a checkpoint from an existing notebook, select the notebook from the Notebook Dashboard and click Copy. The Copy button is only enabled when a single notebook is selected and that notebook is either running or stopped. Select Save to trainML as the Copy Type. Select Checkpoint from the Type dropdown and enter the name for the new checkpoint in the New Checkpoint Name field. You have the option to copy either the /opt/trainml/models folder or the /opt/trainml/output folder. Select which folder you wish to copy from the Save Directory dropdown and click Copy to being the copy process. You will be automatically navigated to the checkpoints dashboard where you can monitor the progress of the checkpoint creation.

Training/Inference Job Output

Training or inference jobs can be configured to send their output to a trainML checkpoint instead of an external source. To create a checkpoint from a job, select trainML as the Output Type and checkpoint as the Output URI in the data section of the job form. Once each worker in the job finished, it will save the entire directory structure of /opt/trainml/output to a new checkpoint with the name Job - <job name> if there is one worker or Job - <job name> Worker <worker number> if there are multiple workers.

Using a Checkpoint

Checkpoints can be used by selecting the checkpoint from the dropdown field in the section of the job form. Select the desired checkpoint from the list and create the job. Once the job is running you can access the checkpoint in the /opt/trainml/checkpoint directory, or using the TRAINML_CHECKPOINT_PATH environment variable.

To add a checkpoint using the Python SDK, include it in the checkpoint array in the model dictionary:

await trainml.jobs.create(
...
model=dict(
...
checkpoints=[
dict(id="stable-diffusion-v2-1", public=True),
"my-checkpoint"
],
)
)

To add a checkpoint using the CLI, use the --checkpoint or --public-checkpoint flags:

trainml job create inference <...> \
--public-checkpoint stable-diffusion-v2-1 \
--checkpoint my-checkpoint \
<...>

Removing a Checkpoint

Checkpoints can only be removed once all jobs that are configured to use them are finished. To remove a checkpoint, select the checkpoint, and click the Delete button. Since this action is permanent, you will be prompted to confirm prior to deleting.