Implementing an Automated a Model Training CI/CD Pipeline
This tutorial shows how to initiate trainML training jobs using GitHub Actions as part of a model CI/CD pipeline.
The model training code is sourced from the PyTorch Examples repository.
Prerequisites
Before beginning this example, ensure that you have satisfied the following prerequisites.
- A valid trainML account with a non-zero credit balance
- A GitHub account
GitHub Setup
On GitHub go to the example repository and fork it using the Fork
button in the upper right. Once the fork is complete, navigate to the Settings
tab of the new repository, and click Secrets
. Create two new repository secrets (through the New Repository Secret
button in the upper right) called TRAINML_USER
and TRAINML_KEY
. Set their values to the user
and key
properties of a credentials.json
file of a trainML API key for your trainML account.
Go to the Actions
tab of your repository and click the button to enable workflows.
Executing the Workflow
To activate the workflow, make a minor change to the README.md
file and commit your changes to the master
branch. On the Actions
tab, you will see a new workflow run. When the workflow run completes, expand the Create Training Job
step to see the log output from the training job creation.
Navigate to the Training Job Dashboard. You will see a new job called Git Automated Training - <commit hash>
. Click the View
button to observe the training progress.
When the training job finishes, navigate to the Models Dashboard. Here you will see the saved model with both the code and the training artifacts from the specific commit that originally initiated the workflow. This model can now be used for subsequent inference jobs, or examined with a notebook.
Understanding the process
The .github/workflows
and scripts
folder contain the files that facilitate this process. The rest of the repository represents the model code itself. The .github/workflows
folder contains the yml files that define the GitHub workflows and the scripts
folder includes the files that are ran by the steps in the workflow. The scripts
files are what use the trainML SDK to provision resources on the trainML platform.
run-model-training.yml
This file defines the GitHub Workflow. The contents are the following:
name: Run Model Training
on: [push]
jobs:
Create-Training-Job:
if: ${{ github.ref == 'refs/heads/master' }}
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: 3.8
- name: Install Dependencies
run: |
python -m pip install --upgrade pip
pip install trainml
- name: Create Training Job
run: python scripts/run_training.py
env:
TRAINML_USER: ${{ secrets.TRAINML_USER }}
TRAINML_KEY: ${{ secrets.TRAINML_KEY }}
This file directs GitHub to run this workflow when a change is pushed to the repository. The workflow consists of a single job called Create-Training-Job
that will only run if the push is to the master
branch. This job has 4 steps. The first pulls the repository and checks out the commit, the next two setup the environment and install the dependencies, and the last runs the run_training.py
script from the scripts
folder. The last step uses GitHub Secrets to make the trainML API Keys available as environment variables. By using these exact environment variable names, the trainML SDK will use them implicitly for authentication.
run_training.py
This file gets called as the last step in the GitHub Workflow and actually creates the training job for the commit that activated the workflow. The contents are the following:
from trainml import TrainML
import asyncio
import os
trainml_client = TrainML()
async def create_job(build_serial):
print(build_serial)
job = await trainml_client.jobs.create(
name=f"Git Automated Training - {build_serial}",
type="training",
gpu_type="RTX 2080 Ti",
gpu_count=1,
disk_size=10,
worker_commands=[
f"git checkout {build_serial} && python main.py --epochs 2 --batch-size 64 --arch resnet50 $TRAINML_DATA_PATH 2>&1 | tee train.log"
],
data=dict(
datasets=[dict(id="ImageNet", type="public")],
output_type="trainml",
),
model=dict(
source_type="git",
source_uri=f"{os.environ.get('GITHUB_SERVER_URL')}/{os.environ.get('GITHUB_REPOSITORY')}.git",
),
)
return job
if __name__ == "__main__":
job = asyncio.run(create_job(os.environ.get("GITHUB_SHA")))
## Job information should be saved in a persistent datastore to pull for status and verify completion
print(job)
The create_job
function creates a new job with the name Git Automated Training - {build_serial}
where the build_serial
is available as the GITHUB_SHA
environment variable. The worker command explicitly checks out this specific commit that is creating the training job to avoid a race condition between job creation and new commits being pushed to the master. The rest of the command simply runs the model training code with the desired parameters.
Since the purpose of this process is to evaluate the training results of new model code updates, this file hard codes the dataset (in this case, to ImageNet) so that all commits will be trained on the same data. The model definition is specified using boiler-plate code that automatically sets it as the git repository that is calling this workflow.
Because the output_type
is set to trainML
the results of this training run will be saved as a trainML Model with the job name prefixed by Job -
, so the model name will also include the build serial.