Skip to main content

Making Datasets More Flexible and Expanding Environment Options

· 4 min read

Persistent Datasets just got even better. Not only can you use the same dataset across many jobs in parallel at no additional charge, now you can attach multiple datasets to a single job for free. If that wasn't enough, you can now dynamically change the datasets attached to any notebook job as your needs evolve through the model development process. Additionally, more options have been added for job base environments, allowing you to save time and storage quota by using specific versions of popular frameworks.

Datasets

Adding Multiple Datasets

When creating a new notebook or training job, in the Data section of the job form, click the Add Dataset button. To add a public dataset select Public Dataset from the Dataset Type field and then select the dataset from the Dataset field. To add a private dataset select My Dataset from the Dataset Type field and then select the dataset from the Dataset field. Continue clicking the Add Dataset button to add additional datasets until all required datasets have been selected.

If you add a single dataset, the dataset will be mounted to /opt/trainml/input inside the job workers. If you add more than one dataset, each dataset will be mounted into its own directory inside the /opt/trainml/input directory. The directory name will be the name of the dataset, all lower case with spaces converted to underscores. For example, the PASCAL VOC dataset will be mounted to /opt/trainml/input/pascal_voc if it is one of multiple datasets selected.

As before, there is no additional charge to attach a dataset to a job. You only incur storage charges once for each private dataset, regardless of how many jobs it is used in. Public datasets are always free to use.

Changing Datasets For Existing Jobs

To edit the datasets for existing notebook jobs, the job must first be in the stopped status. Once the job is stopped, select it from the list and click the Edit button. Expand the Data section of the edit form to see the currently enabled datasets. Edit this list in the same manner you did to create the job and click Confirm. The job may enter the status of updating while it prepares your datasets. Once the job status returns to stopped, you can restart the job and the new datasets will be attached.

New Environment Options

Job environments determine the software the is preinstalled in the operating environment for the job workers. Now, in addition to our pre-built deep learning environments, you can now select environments with specific versions of popular deep learning frameworks. Selecting the correct environment will save you time setting up your environment and minimize the amount of space required for each worker. The space required by the base environment is free and does not count towards your storage quota, but modifications to the environment will. If your model needs a specific version of PyTorch or Tensorflow, be sure to select the version specific environment, since downgrading PyTorch or Tensorflow can consume a large amount of space (10GB+).

The Base Environment option in the Environment section of the job form has now been expanded into 3 separate fields.

  1. Python Version: The Python version of the conda environment that forms the base of the environment. All base environments contain a wide variety of popular data science, machine learning, and GPU-acceleration libraries. Only Python 3.6 and 3.7 environments are currently available.

  2. Framework: The primary deep learning framework to be used. If you do not have specific version requirements for your model, select Deep Learning. Otherwise, select the major framework you intend to use to see the available versions.

  • Deep Learning: All supported frameworks are installed using their latest version compatible with the Python version selected: Tensorflow, PyTorch, MXNet.
  • PyTorch: Select this option if your model code requires a specific version of PyTorch
  • Tensorflow: Select this option if your model code requires a specific version of Tensorflow
  • MXNet: Select this option if your model code requires a specific version of MXNet
  1. Framework Version: The version of major framework selected.

These new options are available as environment selection for both notebook and training jobs, and you can select these new environments as the default environment for each job type on the Account Settings page.