trainML Documentation
  • Docs
  • Tutorials
  • Blog
  • Login/Signup

›Reference

Getting Started

  • Creating an Account
  • Adding Credits
  • Running a Notebook Job
  • Running a Training Job
  • Running an Inference Job
  • Deploying an Inference Endpoint

Reference

  • Job Configuration
  • Datasets
  • Models
  • SDK and Command Line Interface
  • Third-Party Keys
  • Billing and Credits
  • Projects
  • CloudBender
  • Local Connection Capability
  • Environment Variables

Datasets

Datasets are a great option for reducing the storage requirements of jobs as well as reusing data across many jobs. Public dataset are complete free to use and private datasets only will only incur storage charges for their own size, no matter how many jobs they are used on (trainML provider only).

Public Datasets

Public datasets are a collection of popular public domain machine learning datasets that are loaded and maintained by trainML. If you are planning to use one of the below datasets in your model, be sure to select it in the job form as instructed below instead of provisioning worker storage and downloading it yourself.

Image Classification

  • CIFAR-10: https://www.cs.toronto.edu/~kriz/cifar.html
  • CIFAR-100: https://www.cs.toronto.edu/~kriz/cifar.html
  • ImageNet: http://www.image-net.org/

Text Processing

  • MultiNLI: https://cims.nyu.edu/~sbowman/multinli/
  • SNLI: https://nlp.stanford.edu/projects/snli/
  • WikiText-103: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/

Object Detection/Segmentation

  • COCO: https://cocodataset.org/#home
  • PASCAL VOC: http://host.robots.ox.ac.uk/pascal/VOC/

If you would like a public dataset added, please contact us with a link to the dataset and a brief description of what you need it for.

Using a Public Dataset

Public datasets can be used by selecting Public Dataset from the Dataset Type field in the Data section of the job form. Select the desired dataset from the list and create the job. Once the job is running you can access the dataset in the /opt/trainml/input directory, or using the TRAINML_DATA_PATH environment variable.

Private Datasets

Private dataset enable you to load a dataset once and reuse that dataset on any future job or multiple jobs at the same time while only incurring storage charges once based on the size of the dataset (trainML provider only). Private datasets are also included in the 50 GB free storage option. Datasets are immutable to prevent unexpected data changes impacting jobs. If you need to revise a dataset, you must create a new one and remove the old one. The maximum size of any dataset is 500 GB, but you can have unlimited datasets.

Creating a Dataset

From the Datasets section, click the Create button. Specify the name of the new dataset in the Name field and then select the Source Type of the data to populate the new dataset:

  • Local: Select this option if the data resides on your local computer. You will be required to connect to the dataset for this option to work. Jobs using the local storage options will wait indefinitely for you to connect.
  • HTTP: Select this option if the data resides on a publicly accessible HTTP or FTP server.
  • AWS: Select this option if the data resides on Amazon S3.
  • GCP: Select this option if the data resides on Google Cloud Storage.
  • Kaggle: Select this option if the data is from a Kaggle Competition or Dataset

Specify the path of the data within the storage type specified in the Path field. If you specify a compressed file (zip, tar, tar.gz, or bz2), the file will be downloaded and automatically extracted prior to any worker starting. If you specify a directory path (ending in /), it will run a sync starting from the path provided, downloading all files and subdirectories from the provided path. Valid paths for each Source Type are the following:

  • Local: Must begin with / (absolute path), ~/ (home directory relative), or $ (environment variable path). Relative paths (using ./) are not supported.
  • HTTP: Must begin with http://, https://, ftp://, or ftps://.
  • AWS: Must begin with s3://.
  • GCP: Must begin with gs://.
  • Kaggle: Must be the short name of the competition or datasets compatible with the Kaggle API.

Click Create to start populating the dataset. If you selected any option except Local, the dataset download will take place automatically and the dataset will change to a state of ready when it is complete. If selected Local, you must connect to the dataset by selecting the dataset and clicking the Connect button to proceed with the data population.

Note on Kaggle Data

Kaggle datasets require you to specify if the data will be populated from a Kaggle competition or a datasets. In the Type field, select Competition if you are downloading the data for a competition you have entered or Dataset if you are downloading other public or personal dataset.

You can only download competition datasets if you have already read and accepted the rules through the Kaggle website

For the Path field, you must specify the short name Kaggle uses the competition or the datasets. The two easiest ways to find this short name are:

  1. The URL path of the competition or dataset you wish to download. For example, if you are viewing this dataset on the 2020 US Election in your web browser, the URL in your address bar is https://www.kaggle.com/unanimad/us-election-2020. If you want to download this dataset into trainML, specify unanimad/us-election-2020 in the Path field, specifically, the URL component after www.kaggle.com/. If you are viewing the Mechanisms of Action competition in your web browser, the URL in your address bar is https://www.kaggle.com/c/lish-moa. If you want to download this competition's data into trainML, specify lish-moa in the Path field, specifically, the URL component after www.kaggle.com/c/
  2. Viewing the API command from the Kaggle web interface. For datasets, if you click the triple dot button on the far right side of the Kaggle Dataset menu bar, next to the New Notebook button, there is a Copy API command button. If you click this for this dataset on the 2020 US Election, it will copy kaggle datasets download -d unanimad/us-election-2020 into your clipboard. If you want to download this dataset into trainML, specify unanimad/us-election-2020 in the Path field, specifically, the command component after download -d. For a competition, if you click the Data tab on the Kaggle Competition menu bar, right above the Data Explorer, it will list the API command to download the datasets. If you are viewing the Mechanisms of Action competition, you will see kaggle competitions download -c lish-moa. If you want to download this competition's data into trainML, specify lish-moa in the Path field, specifically, the command after download -c

Using a Private Dataset

Private datasets can be used by selecting My Dataset from the Dataset Type field in the Data section of the job form. Select the desired dataset from the list and create the job. Once the job is running you can access the dataset in the /opt/trainml/input directory, or using the TRAINML_DATA_PATH environment variable.

Removing a Dataset

Dataset can only be removed once all jobs that are configured to use them are terminated. To remove a dataset, ensure that the Active Jobs column is zero, select the dataset, and click the Delete button. Since this action is permanent, you will be prompted to confirm prior to deleting.

← Job ConfigurationModels →
  • Public Datasets
    • Using a Public Dataset
  • Private Datasets
    • Creating a Dataset
    • Using a Private Dataset
    • Removing a Dataset
trainML Documentation
Docs
Getting StartedTutorials
Legal
Privacy PolicyTerms of Use
Copyright © 2022 trainML, LLC, All rights reserved