Checkpoints and Datasets are now supported output destinations for trainML Training and Inference jobs.
How It Works
Previously, the trainML output type did not accept the output_uri
property. This property can now be specified as model
(the default if not provided), dataset
, or checkpoint
. Additionally, there is a new output_options
field called save_model
. This field is set to True
by default when using the model
output type and False
when using dataset
or checkpoint
. Currently, this field can only be changed when using the trainML SDK.
When save_model
is set to True
, the TRAINML_OUTPUT_PATH
environment variable is set to /opt/trainml/models
instead of /opt/trainml/output
and the contents of /opt/trainml/models
is uploaded to the output destination. If set to false, TRAINML_OUTPUT_PATH
remains set to /opt/trainml/output
and only that directory is uploaded to the output destination.
Using the Web Platform
To create a dataset or checkpoint from a job, select trainML
as the Output Type
and the desired entity type data section of the job form. Once each worker in the job finishes, it will save the entire directory structure of /opt/trainml/output
to a new dataset or checkpoint with the name Job - <job name>
if there is one worker or Job - <job name> Worker <worker number>
if there are multiple workers.
Using the SDK
To save a training job's output to a checkpoint instead of a model, use the following syntax:
job = await trainml.jobs.create(
"Training Checkpoint Output",
type="training",
...
data=dict(
...
output_type="trainml",
output_uri="checkpoint",
output_options=dict(save_model=False),
),
...
)