Join datasets processor

Edited

This processor allows you to join two datasets together based on a field or fields common to both datasets.

Note that this processor is no longer limited by the number of records in the distant dataset! You can now easily cross-reference your own datasets with large distant datasets, such as France’s SIREN (a directory of ID codes for every business in the country).

For example, imagine you want to enrich dataset A with data from dataset B. If both datasets have a field in common, it's possible to join them together, and thus be able to combine other data from dataset B with those in dataset A.

Note that if you use this processor to join your dataset with another that contains geoshapes, this can considerably increase how much memory your dataset uses. If you require geoshapes to produce a map but want to avoid this issue, remember that you can create your map in Studio, where Opendatasoft provides you with options for geographic data.

Configuring the processor

Set the parameters of the Join datasets processor, according to the information in the table below.

Label

Description

Mandatory

Dataset

Dataset used for the join. It can be selected from the portal's datasets or from Opendatasoft's network.

Yes

Local keys

Local field for the records you wish to use to join with the other dataset. More than one key can be specified.

Yes

Remote keys

Remote field used to identify the matching records in the remote dataset. More than one key can be specified.

Yes

Retrieve all fields on publish

Check this box to retrieve all the fields from the remote dataset

No

Output fields

List of fields to retrieve from the remote dataset

Yes

Output fields prefix

You can add a prefix to be added to the output fields. This allows you to make several joins with the same distant dataset and retrieve the same distant fields. See below for an example.

No

Case sensitive

Set to retrieve the field name without accents and in lower case format. Selected by defaut.

No

Multivalued and Separator

In some cases, the remote dataset may contain more than one row that matches the local key. In that case, you can decide to have them be listed in a single line, separated by the symbol you decide. By default, this is "/".

No

Reprocess all records on schedule

Triggers a full re-processing of all records upon publication if the remote dataset has changed (if not selected, the incremental mode is used by default).
This option ensures that all records are joined with the most recent version of the remote dataset, which is practical when the remote dataset is regularly updated.

This option is only available if a scheduler has been set for the dataset.

No

Example of a simple join

Dataset A: list of taxi stations in Paris (dataset name: paris_taxis_stations).

station_id

station_name

station_address

1

Tour Eiffel

69 quai Branly, 75007 Paris

2

Rennes - Montparnasse

1 place du dix huit Juin 1940, 75006 Paris

Dataset B: number of taxis waiting per station in Paris.

station_id

number

1

10

2

15

Dataset A after being enriched with a field from dataset B, using the Join dataset processor:

station_id

number

station_name

station_address

1

10

Tour Eiffel

69 quai Branly, 75007 Paris

2

15

Rennes - Montparnasse

1 place du dix huit Juin 1940, 75006 Paris

Example of a join with "Multivalued" parameter set

Dataset A contains two rows for the first station:

station_id

station_name

station_address

1

Tour Eiffel

69 quai Branly, 75007 Paris

1

Quai Branly

69 quai Branly, 75007 Paris

2

Rennes - Montparnasse

1 place du dix huit Juin 1940, 75006 Paris

If "Multivalued" is set (with "|" set as separator), the join will result in:

station_id

number

station_name

station_address

1

10

Tour Eiffel|Quai Branly

69 quai Branly, 75007 Paris|69 quai Branly, 75007 Paris

2

15

Rennes - Montparnasse

1 place du dix huit Juin 1940, 75006 Paris

However, if "Multivalued" is not set, the join will result in:

station_id

number

station_name

station_address

1

10

Tour Eiffel

69 quai Branly, 75007 Paris

1

10

Quai Branly

69 quai Branly, 75007 Paris

2

15

Rennes - Montparnasse

1 place du dix huit Juin 1940, 75006 Paris

An example of using the Output fields prefix option

You may wish to make several joins in a single dataset based on the same field. In the example below, we have a dataset listing the terminus stations (start and end) for each Paris Metro line.

line_id

station_name_start

station_name_end

4

Bagneux

Porte de Clignancourt

Let's imagine that for each Metro line, I wish to enrich my dataset with each terminus station's address. I have access to a second dataset with the addresses for each station in the entire network.

station_id

station_name

station_address

3

Bagneux

1 rue du Verdun, 92220 Bagneux

5

Porte de Clignancourt

83 Bd Ornano, 75018 Paris

To retrieve the address of the start station, we create a join with the second dataset using "station_name" and "station_name_start" as the join keys, with "station_address" as the output field.

But when I create a second join to retrieve the address of the end station, I will get the same output field "station_address"—and this will overwrite the previous one. Instead, I can use the "Output fields prefix" option to distinguish the two consecutive output fields.

In the first join I add the prefix "START_", and in the second I add "END_". In this way, I can perform two otherwise identical joins, and distinguish them in my final dataset.

line_id

station_name_start

station_name_end

START_station_address

END_station_address

4

Bagneux

Porte de Clignancourt

69 quai Branly, 75007 Paris

1 place du dix huit Juin 1940, 75006 Paris

Can I join fields with different types?

Yes. Ideally, you should match field types to avoid any unexpected behavior. However, integers, decimals, and text fields containing numerical values can be joined together.

Note that in the past, text fields did not take into account zeros at the beginning of a value, so that the text value "01262" would match with the integer value "1262". This is no longer the case.

When a text field value begins with a zero, the zero is taken into account in a join with a numerical value.