Checkpoints
Checkpoints are a great option to store an immutable version of a model's weights for reuse in other jobs. Checkpoints only will only incur storage charges for their size, and can be used on unlimited jobs simultaneously. Storing checkpoints separately from models gives you the flexibility to make code updates without changing the weights, or reuse the same code with different weights.
Public Checkpoints
Public checkpoints are a collection of popular public domain machine learning checkpoint files that are loaded by trainML. If you are planning to use one of the below checkpoints in your model, be sure to select it in the job form as instructed below instead of provisioning worker storage and downloading it yourself.
If you attach a public checkpoint to your job, you accept the terms and conditions documented in any license, readme, or model card file present in the mounted checkpoint directory or at the associated URL listed below.
File-based Checkpoints:
stable-diffusion-v1-4
: https://huggingface.co/CompVis/stable-diffusion-v-1-4-originalstable-diffusion-v1-5
: https://huggingface.co/runwayml/stable-diffusion-v1-5stable-diffusion-v2-depth
: https://huggingface.co/stabilityai/stable-diffusion-2-depthstable-diffusion-v2-inpainting
: https://huggingface.co/stabilityai/stable-diffusion-2-inpaintingstable-diffusion-v2-upscaler
: https://huggingface.co/stabilityai/stable-diffusion-x4-upscalerstable-diffusion-v2-1
: https://huggingface.co/stabilityai/stable-diffusion-2-1
Diffusers Compatible Checkpoints:
stable-diffusion-v1-4-diffuser
: https://huggingface.co/CompVis/stable-diffusion-v1-4stable-diffusion-v1-5-diffuser
: https://huggingface.co/runwayml/stable-diffusion-v1-5stable-diffusion-v2-diffuser
: https://huggingface.co/stabilityai/stable-diffusion-2stable-diffusion-v2-1-diffuser
: https://huggingface.co/stabilityai/stable-diffusion-2-1
Language Model Checkpoints:
bloomz-7b1
: https://huggingface.co/bigscience/bloomz-7b1gpt-j-6b
: https://huggingface.co/EleutherAI/gpt-j-6Bgpt-j-6b-float16
: https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16gpt-neox-20b
: https://huggingface.co/EleutherAI/gpt-neox-20b
Speech Recognition Checkpoints:
whisper-large-v2
: https://huggingface.co/openai/whisper-large-v2whisper-medium
: https://huggingface.co/openai/whisper-mediumwhisper-small
: https://huggingface.co/openai/whisper-smallwhisper-base
: https://huggingface.co/openai/whisper-basewhisper-tiny
: https://huggingface.co/openai/whisper-tiny
If you would like a public checkpoint added, please contact us with a link to the checkpoint and a brief description of what you need it for.
Using a Public Checkpoint
Public checkpoints can be used by checking the Public
checkbox in the Model
section of the job form. Select the desired checkpoint from the list and create the job. Once the job is running you can access the checkpoint in the /opt/trainml/checkpoint
directory, or using the TRAINML_CHECKPOINT_PATH
environment variable.
Private Checkpoints
Private checkpoints enable you to load a checkpoint once and reuse that checkpoint on any future job or multiple jobs at the same time while only incurring storage charges once based on the size of the checkpoint. Private checkpoints are also included in the 50 GB free storage allotment. Checkpoints are immutable to prevent unexpected data changes impacting jobs. If you need to revise a checkpoint, you must create a new one and remove the old one. The maximum size of any checkpoint is 500 GB, but you can have unlimited checkpoints.
Creating a Checkpoint
Checkpoints can be created from three different sources: external, notebooks, and training/inference job output.
External Checkpoint Source
To create a checkpoint from an external sources, navigate to the Checkpoints Dashboard from the side navigation and click the Create
button. Specify the name of the new checkpoint in the Name
field and then select the Source Type
of the location from which to populate the new checkpoint:
AWS
: Select this option if the checkpoint data resides on Amazon S3.Azure
: Select this option if the checkpoint data resides on Azure Blob Storage.Hugging Face
: Select this option if the checkpoint data resides in a Hugging Face repository.GCP
: Select this option if the checkpoint data resides on Google Cloud Storage.Git
: Select this option if the checkpoint data resides in a git repository.Kaggle
: Select this option if the checkpoint data is from a Kaggle Competition, Dataset, or Kernel.Local
: Select this option if the checkpoint data resides on your local computer. You will be required to connect to the checkpoint for this option to work. Jobs using the local storage options will wait indefinitely for you to connect.Regional Datastore
- Select this option to mount the checkpoint directly from a Regional Datastore in an existing CloudBender region.Wasabi
- Select this option if the checkpoint data resides on Wasabi Storage.Web
: Select this option if the checkpoint data resides on a publicly accessible HTTP or FTP server.
Specify the path of the checkpoint data within the storage type specified in the Path
field. If you specify a compressed file (zip, tar, tar.gz, or bz2), the file will be automatically extracted. If you specify a directory path (ending in /
), it will run a sync starting from the path provided, downloading all files and subdirectories from the provided path. Valid paths for each Source Type
are the following:
AWS
: Must begin withs3://
.Azure
: Must begin withhttps://
.Hugging Face
: Must be in the format<namespace>/<repo>
.GCP
: Must begin withgs://
.Git
: Both http and ssh git repository formats are supported. To use the ssh format, you must configure a git ssh key.Kaggle
: Must be the short name of the competition, dataset, or kernel compatible with the Kaggle API.Local
: Must begin with/
(absolute path),~/
(home directory relative), or$
(environment variable path). Relative paths (using./
) are not supported.Regional Datastore
- Must begin with/
(absolute path)Wasabi
: Must begin withs3://
.Web
: Must begin withhttp://
,https://
,ftp://
, orftps://
.
Source Specific fields
Type
(Kaggle Only): The type of Kaggle data you are specifying, Competition, Dataset, or Kernel (Notebook).Endpoint
(Wasabi Only): The service URL of the Wasabi bucket you are using.Path
(Regional Datastore Only): The subdirectory inside the regional datastore to load the data from. Use/
to load the entire datastore.Branch
(Hugging Face Only): The branch to download (if not the default).
Click Create
to start populating the checkpoint. If you selected any option except Local
, the checkpoint download will take place automatically and the checkpoint will change to a state of ready
when it is complete. If selected Local
, you must connect to the checkpoint by selecting the checkpoint and clicking the Connect
button to proceed with the data population.
Notebooks
To create a checkpoint from an existing notebook, select the notebook from the Notebook Dashboard and click Copy
. The Copy
button is only enabled when a single notebook is selected and that notebook is either running
or stopped
. Select Save to trainML
as the Copy Type
. Select Checkpoint
from the Type
dropdown and enter the name for the new checkpoint in the New Checkpoint Name
field. You have the option to copy either the /opt/trainml/models
folder or the /opt/trainml/output
folder. Select which folder you wish to copy from the Save Directory
dropdown and click Copy
to being the copy process. You will be automatically navigated to the checkpoints dashboard where you can monitor the progress of the checkpoint creation.
Training/Inference Job Output
Training or inference jobs can be configured to send their output to a trainML checkpoint instead of an external source. To create a checkpoint from a job, select trainML
as the Output Type
and checkpoint
as the Output URI
in the data section of the job form. Once each worker in the job finished, it will save the entire directory structure of /opt/trainml/output
to a new checkpoint with the name Job - <job name>
if there is one worker or Job - <job name> Worker <worker number>
if there are multiple workers.
Using a Checkpoint
Checkpoints can be used by selecting the checkpoint from the dropdown field in the section of the job form. Select the desired checkpoint from the list and create the job. Once the job is running you can access the checkpoint in the /opt/trainml/checkpoint
directory, or using the TRAINML_CHECKPOINT_PATH
environment variable.
To add a checkpoint using the Python SDK, include it in the checkpoint
array in the model
dictionary:
await trainml.jobs.create(
...
model=dict(
...
checkpoints=[
dict(id="stable-diffusion-v2-1", public=True),
"my-checkpoint"
],
)
)
To add a checkpoint using the CLI, use the --checkpoint
or --public-checkpoint
flags:
trainml job create inference <...> \
--public-checkpoint stable-diffusion-v2-1 \
--checkpoint my-checkpoint \
<...>
Removing a Checkpoint
Checkpoints can only be removed once all jobs that are configured to use them are finished. To remove a checkpoint, select the checkpoint, and click the Delete
button. Since this action is permanent, you will be prompted to confirm prior to deleting.