Skip to main content

Reusable Checkpoints Make Models Even More Flexible

· 2 min read

trainML now supports the creation and use of Checkpoints to store immutable versions of large model weight files.

How It Works

Checkpoints enable you to load a checkpoint once and reuse that checkpoint on any future job or multiple jobs at the same time while only incurring storage charges once based on the size of the checkpoint. Private checkpoints are also included in the 50 GB free storage allotment. Checkpoints are immutable to prevent unexpected data changes impacting jobs. If you need to revise a checkpoint, you must create a new one and remove the old one. The maximum size of any checkpoint is 500 GB, but you can have unlimited checkpoints.

Using the Web Platform

To create a checkpoint, navigate to the Checkpoints Dashboard from the side navigation and click the Create button. Specify the name of the new checkpoint in the Name field and then select the Source Type of the location from which to populate the new checkpoint.

Specify the path of the checkpoint data within the storage type specified in the Path field. If you specify a compressed file (zip, tar, tar.gz, or bz2), the file will be automatically extracted. If you specify a directory path (ending in /), it will run a sync starting from the path provided, downloading all files and subdirectories from the provided path.

Checkpoints can be used by selecting the checkpoint from the dropdown field in the section of the job form. Select the desired checkpoint from the list and create the job. Once the job is running you can access the checkpoint in the /opt/trainml/checkpoint directory, or using the TRAINML_CHECKPOINT_PATH environment variable.

Using the SDK

To create a checkpoint using the trainML SDK, use syntax similar to the following:

checkpoint = await trainml.checkpoints.create(
name="s3-checkpoint",
source_type="aws",
source_uri="s3://<model bucket>/<model path>/model.zip",
)

To use a checkpoint in a job, use the following syntax:

job = await trainml.jobs.create(
name="Checkpoint Job",
type="training",
...
model=dict(
source_type="git",
source_uri="https://github.com/trainML/examples.git",
checkpoints=[
"s3-checkpoint",
],
),
...
)