trainML now supports the creation and use of Checkpoints to store immutable versions of large model weight files.
How It Works
Checkpoints enable you to load a checkpoint once and reuse that checkpoint on any future job or multiple jobs at the same time while only incurring storage charges once based on the size of the checkpoint. Private checkpoints are also included in the 50 GB free storage allotment. Checkpoints are immutable to prevent unexpected data changes impacting jobs. If you need to revise a checkpoint, you must create a new one and remove the old one. The maximum size of any checkpoint is 500 GB, but you can have unlimited checkpoints.
Using the Web Platform
To create a checkpoint, navigate to the Checkpoints Dashboard from the side navigation and click the Create
button. Specify the name of the new checkpoint in the Name
field and then select the Source Type
of the location from which to populate the new checkpoint.
Specify the path of the checkpoint data within the storage type specified in the Path
field. If you specify a compressed file (zip, tar, tar.gz, or bz2), the file will be automatically extracted. If you specify a directory path (ending in /
), it will run a sync starting from the path provided, downloading all files and subdirectories from the provided path.
Checkpoints can be used by selecting the checkpoint from the dropdown field in the
section of the job form. Select the desired checkpoint from the list and create the job. Once the job is running you can access the checkpoint in the /opt/trainml/checkpoint
directory, or using the TRAINML_CHECKPOINT_PATH
environment variable.
Using the SDK
To create a checkpoint using the trainML SDK, use syntax similar to the following:
checkpoint = await trainml.checkpoints.create(
name="s3-checkpoint",
source_type="aws",
source_uri="s3://<model bucket>/<model path>/model.zip",
)
To use a checkpoint in a job, use the following syntax:
job = await trainml.jobs.create(
name="Checkpoint Job",
type="training",
...
model=dict(
source_type="git",
source_uri="https://github.com/trainML/examples.git",
checkpoints=[
"s3-checkpoint",
],
),
...
)