Creating a dataset with multiple files

Edited

You can add multiple source files to a dataset in the following ways:

  • Add each file one by one

  • Add multiple files at the same time through an archive file

  • Add multiple files via an FTP server

Note that files are limited to 240 Mo. If your files are too big, you can try compressing them. For more information on compressed or uncompressed file formats, see Supported file formats.

Sourcing files one by one

This method consists of adding files one by one into the platform, creating as many sources as the number of added files. In that case, because a new source is created for each added file, those can have different formats.

For more information, see our Supported file formats.

When uploading files one by one, the first file to be added determines the data schema.If the following files contain fields that do not match those in the first file, they will be ignored by the platform.

  1. Create a dataset using your first file as a source

  2. From Sources tab of the dataset, click the Add a source button

  3. Add the next file

Be careful when deleting files from a dataset created with multiple files, especially those with different data schemas. If the first file is deleted, the whole dataset will appear as empty.

Sourcing multiple files within an archive

This method consists of adding several files at the same time via an archive file, creating a single source for all the added files. In that case, because only one source is created for all the files, those have to have the same format.

For more information, see Supported formats.

With this method, the platform chooses the file with the oldest modification time to determine the data schema.

  1. Create an archive file with the files to add to the same dataset.

  2. In Catalog > Datasets, click on the New dataset button.

  3. Add the archive file as a source, using one of the three available methods under the Retrieve a file section. For more information, see Retrieving a file from your computer, a URL, or an FTP server.

  4. From the preview of the first 20 records that opens, configure the source.

  5. Configure the dataset information or use the prefilled values.

Sourcing multiple files stored on an FTP server

This method consists of connecting the platform to the directory of an FTP server in order to retrieve all the files contained in this directory.

All the files in the directory need to have the same format and schema (for example, CSV files with the same column titles). Note also that if the URL points towards a directory containing a compressed file, the latter will be imported into the platform as is and will not be unzipped.

  1. In Catalog > Datasets, click on the New dataset button.

  2. In the wizard that opens, select From an FTP server under the Retrieve a file section.

  3. Configure your FTP connection.

    • FTPS servers are supported for this method (for example, ftps://login:password@example.org/my_directory/my_dataset).

    • When synchronizing from a remote FTP location, Opendatasoft keeps a persistent cache and does not automatically prune files missing from the remote directory. If you need some cleanup to be performed, to the right of the resource, click Clean cache.

  4. From the preview of the first 20 records that opens, configure the source.

  5. Configure the dataset information or use the prefilled values.

Note that when you upload files to this FTP folder, only the data from new files will be taken into account and loaded.