Skip to main content

Create Checkpoints and Datasets from Job Outputs

· 2 min read

Checkpoints and Datasets are now supported output destinations for trainML Training and Inference jobs.

How It Works

Previously, the trainML output type did not accept the output_uri property. This property can now be specified as model (the default if not provided), dataset, or checkpoint. Additionally, there is a new output_options field called save_model. This field is set to True by default when using the model output type and False when using dataset or checkpoint. Currently, this field can only be changed when using the trainML SDK.

When save_model is set to True, the TRAINML_OUTPUT_PATH environment variable is set to /opt/trainml/models instead of /opt/trainml/output and the contents of /opt/trainml/models is uploaded to the output destination. If set to false, TRAINML_OUTPUT_PATH remains set to /opt/trainml/output and only that directory is uploaded to the output destination.

Using the Web Platform

To create a dataset or checkpoint from a job, select trainML as the Output Type and the desired entity type data section of the job form. Once each worker in the job finishes, it will save the entire directory structure of /opt/trainml/output to a new dataset or checkpoint with the name Job - <job name> if there is one worker or Job - <job name> Worker <worker number> if there are multiple workers.

Using the SDK

To save a training job's output to a checkpoint instead of a model, use the following syntax:

job = await trainml.jobs.create(
"Training Checkpoint Output",
type="training",
...
data=dict(
...
output_type="trainml",
output_uri="checkpoint",
output_options=dict(save_model=False),
),
...
)