trainML Documentation
  • Docs
  • Tutorials
  • Blog
  • Login/Signup

›All Blog Posts

All Blog Posts

  • CloudBender
  • NVIDIA NGC Catalog Integration
  • Collaborative Projects for Resource Sharing
  • Customer Provided Job Environments
  • REST Endpoints for Inference
  • Automatic Dependency Installation
  • Consolidated Account Billing
  • Load Model Code Directly From Your Laptop
  • Start Training Models With One Line
  • RTX 3090 (BFGPU) Instances Now Available
  • Build Full Machine Learning Pipelines with trainML Inference Jobs
  • Store Training Results Directly on the trainML Platform
  • Dataset Viewing
  • Stay Modern with Python 3.8 Job Environments
  • Downloadable Log Extracts for Jobs and Datasets
  • Automate Training with the trainML Python SDK
  • trainML Jobs on Google Cloud Platform Instances
  • Spawn Training Jobs Directly From Notebooks
  • Easy Notebook Forking For Rapid Experimentation
  • Making Datasets More Flexible and Expanding Environment Options
  • Kaggle Datasets and API Integration
  • Centralized, Real-Time Training Job Worker Monitoring
  • Free to Use Public Datasets
  • Major UI Overhaul and Direct Notebook Access
  • Load Data Once, Reuse Infinitely
  • Serverless Deep Learning On Private Git Repositories
  • Google Cloud Storage Integration Released
  • Skip the Cloud Data Transfers with Local Storage
  • Web (HTTP/FTP) Data Downloads Plus Auto-Extraction of Archives

Skip the Cloud Data Transfers with Local Storage

June 17, 2020

trainML

trainML

trainML training jobs can now run on data directly from your local computer and upload their results back without using any cloud intermediary. If you already have the data set on your local computer and want to avoid the repetitive cycle of uploading and downloading from cloud storage, this storage type is for you.

How It Works

The Local option is now available from both the Input Data Type and Output Type fields. If this option is selected, specify the storage path as the location on your local computer you want the data to be copied from or to. The path must be specified as an absolute path (starting with /, like /home/username/data), a home directory relative path (starting with ~/, like ~/data), or an environment variable based path (starting with $, like $HOME/data, where the HOME environment variable on your local computer is set to /home/username). Relative paths ./ are not supported.

Downloading and running the connection utility is mandatory for local storage types. Jobs will wait indefinitely for you connect before downloading data or uploading their results. Additionally, the storage path specified must exist in the same environment as you are running the connection utility. For example, if you run the connection utility inside of a linux virtual machine (VM), the storage path specified must be the path of the data inside the VM, not the host computer.

When using the Local storage type for input data, the contents of the specified directory will be recursively copied to the trainML data path (accessible with the TRAINML_DATA_PATH environment variable) for the workers to access. No automatic extraction of archives will occur, so ensure that the data is already unarchived on your local computer. Additionally, we recommend that you use an isolated path for the input data, which contains only the data you need for this training job and nothing else. Copying more data than necessary will needlessly delay the workers from starting. The speed of this process will be primarily limited by your internet connection's upstream bandwidth; however, the data download duration only costs your patience, not your credits.

When using the Local storage type for output artifacts, the contents of the trainML output path (accessible with the TRAINML_OUTPUT_PATH environment variable) will be zipped with a naming convention of JobName_WorkerNumber_Date_Time.zip and uploaded to the specified local directory.

Similar to downloading, workers will wait indefinitely for you to connect in order to upload their results. However, you continue to be billed as workers are uploading. We recommend that you stay connected to a job for the entire duration of a job that uses the local storage option.

Example Training Job

As an example of how to use this storage type, the following adapts the TensorFlow CIFAR-10 Image Classification Tutorial to use the local storage type. The model preparation step is identical, so if you have already performed that, you can reuse that repository, or simply use ours.

Data Preparation

The data preparation step is significantly simplified, as the only thing that is required is to download the data to your local computer. In order to run this example using the root of your home directory, run the following commands:

cd ~
git clone -b r2.1.0 https://github.com/trainML/tensorflow-example.git
cd tensorflow-example
mkdir data
mkdir output

python official/r1/resnet/cifar10_download_and_extract.py --data_dir=$(pwd)/data
cd data
rm cifar-10-binary.tar.gz
mv cifar-10-batches-bin/* .
cd ..

Model Training

At this point, should have a data folder inside the root of the code repository that contains the data set, and an output folder that is currently empty. Login to the trainML platform and click the Create Training Job+/Training Job+ button to open a new job form. Populate the following fields with the following values:

  • Job Name: Local Storage Example
  • GPU Class: small
  • GPU Count: 1
  • Job Environment: Deep Learning - Python 3.7
  • Model Code Location: -b r2.1.0 https://github.com/trainML/tensorflow-example.git
  • Input Data Type: Local
  • Input data storage path: ~/tensorflow-example/data
  • Output Type: Local
  • Artifact output storage path: ~/tensorflow-example/output
  • Number of Workers: 1
  • Command for Worker 1: PYTHONPATH=$PYTHONPATH:$TRAINML_MODEL_PATH python -m official.vision.image_classification.resnet_cifar_main --num_gpus=1 --data_dir=$TRAINML_DATA_PATH --model_dir=$TRAINML_OUTPUT_PATH --enable_checkpoint_and_export=True --train_epochs=10 --batch_size=1024

Click Next and then Create on the job form review page to start the job. When the job is in the running state, it will wait for you to connect before it tries to download the data set. Connect to the job by following the instructions in the getting started guide. Once you connect, the log viewer will show the data downloading the to workers. Once that's complete, you will see the output from the worker itself during the training, and finally the zipping and uploading.

After the job completes, navigate to the ~/tensorflow-example/output directory on your local computer. You will see a file with the format JobName_WorkerNumber_Date_Time.zip. Unzip it, and you can continue with the analyzing the output step of the tutorial.

Tweet
Recent Posts
  • How It Works
  • Example Training Job
    • Data Preparation
    • Model Training
trainML Documentation
Docs
Getting StartedTutorials
Legal
Privacy PolicyTerms of Use
Copyright © 2022 trainML, LLC, All rights reserved