Skip to main content

Web (HTTP/FTP) Data Downloads Plus Auto-Extraction of Archives

· 2 min read

The HTTP input data type is now available for trainML training jobs. This option is ideal for publicly available dataset that are hosted on public HTTP or FTP servers. If you were previously using wget/curl in your training scripts to download data, this option is for you. Additionally, if you specify a path to an archive as the input storage path, the archive will automatically be extracted before being attached to the workers.

How It Works

Just select HTTP from Input Data Type field on the job form, and specify the URL of the file or path you want to download from. For example, to download the PASCAL VOC 2012 dataset, enter http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar into the Input data storage path location. When the job starts, it will download that file prior to starting any job workers. Since it ends in .tar, it will additionally automatically extract the archive into the trainML data path (accessible with the TRAINML_DATA_PATH environment variable) for the workers to access.

The Web/HTTP storage type supports HTTP, HTTPS, FTP, and FTPS URLs. However, it is not recommended to use an FTP URL that requires a password, as the only way to do this is by embedding the password in the URL.

Automatic Archive Extraction

Both the Web and AWS storage types now support the automatic extraction of archives when used as the input data type. If you specify a URL that ends in .tar, .zip, .tar.gz, or .tar.bz2, the file downloaded will automatically be extracted into the trainML data path and deleted. In the case of AWS, this means it will perform an s3 cp rather than an s3 sync, which can save you API requests. If you specify a URL that ends in /, the AWS storage driver will do an s3 sync from the path specified. The web storage driver will perform a 1 level recursive fetch of that directory on the web/ftp server, and then extract any files that end in .tar, .zip, .tar.gz, or .tar.bz2.