trainML provides an open-source library for programmatic access to trainML resources. It includes a Python module as well as a series of shell commands.
If you install the module in a virtual environment, the shell commands are only available when that virtual environment is activated.
Install the trainML PyPi package into a python virtual environment using Python 3.8 or above:
pip install trainml
In order to use the SDK, you must generate API keys for your trainML account and supply them to the SDK. To create new API keys, go to the account settings page and click the
Create button in the
API Keys section. This will automatically download a
credentials.json file. This file can only be generated once per API key.
Treat this file as a password, as anyone with access to your API key will have the ability to create and control resources in your trainML account.
You can deactivate any API key by clicking the
The easiest way to authenticate is to place the credentials file downloaded into the
.trainml folder of your home directory and ensure only you have access to it. From the directory that the
credentials.json file was downloaded, run the following command:
mkdir -p ~/.trainml mv credentials.json ~/.trainml/credentials.json chmod 600 ~/.trainml/credentials.json
For more ways to configure authentication, review the readme file on Github.
The trainML SDK utilizes the asyncio library to ease the concurrent execution of long running tasks. An example of how to create a dataset from an S3 bucket and immediately run a training job on that dataset is the following:
from trainml.trainml import TrainML import asyncio trainml_client = TrainML() # Create the dataset dataset = asyncio.run( trainml_client.datasets.create( name="Example Dataset", source_type="aws", source_uri="s3://trainml-examples/data/cifar10", ) ) print(dataset) # Watch the log output, attach will return when data transfer is complete asyncio.run(dataset.attach()) # Create the job job = asyncio.run( trainml_client.jobs.create( name="Example Training Job", type="training", gpu_type="GTX 1060", gpu_count=1, disk_size=10, workers=[ "PYTHONPATH=$PYTHONPATH:$TRAINML_MODEL_PATH python -m official.vision.image_classification.resnet_cifar_main --num_gpus=1 --data_dir=$TRAINML_DATA_PATH --model_dir=$TRAINML_OUTPUT_PATH --enable_checkpoint_and_export=True --train_epochs=10 --batch_size=1024", ], data=dict( datasets=[dict(id=dataset.id, type="existing")], output_uri="s3://trainml-examples/output/resnet_cifar10", output_type="aws", ), model=dict(git_uri="firstname.lastname@example.org:trainML/test-private.git"), ) ) print(job) # Watch the log output, attach will return when the training job stops asyncio.run(job.attach()) # Cleanup job and dataset asyncio.run(job.remove()) asyncio.run(dataset.remove())
For more examples of how to use the SDK to create, monitor, and remove trainML resources, review the examples provided here
Command Line Interface
The command line interface is rooted in the
trainml command. To see the available options, run:
To list all jobs:
trainml job list
To list all datasets:
trainml dataset list
To connect to a job that requires the connection capability:
trainml job connect <job ID or name>
To watch the realtime job logs:
trainml job attach <job ID or name>
For more details of how to use the cli to create, monitor, and remove trainML resources, review the readme here