Join datasets processor

This processor allows you to join two datasets together based on a field or fields common to both datasets.

Note that this processor is no longer limited by the number of records in the distant dataset! You can now easily cross-reference your own datasets with large distant datasets, such as France’s SIREN (a directory of ID codes for every business in the country).

For example, imagine you want to enrich dataset A with data from dataset B. If both datasets have a field in common, it's possible to join them together, and thus be able to combine other data from dataset B with those in dataset A.

Note that if you use this processor to join your dataset with another that contains geoshapes, this can considerably increase how much memory your dataset uses. If you require geoshapes to produce a map but want to avoid this issue, remember that you can create your map in Studio, where Opendatasoft provides you with options for geographic data.

Configuring the processor

Set the parameters of the Join datasets processor, according to the information in the table below.

Label	Description	Mandatory
Dataset	Dataset used for the join. It can be selected from the portal's datasets or from Opendatasoft's network.	Yes
Local keys	Local field for the records you wish to use to join with the other dataset. More than one key can be specified.	Yes
Remote keys	Remote field used to identify the matching records in the remote dataset. More than one key can be specified.	Yes
Retrieve all fields on publish	Check this box to retrieve all the fields from the remote dataset	No
Output fields	List of fields to retrieve from the remote dataset	Yes
Output fields prefix	You can add a prefix to be added to the output fields. This allows you to make several joins with the same distant dataset and retrieve the same distant fields. See below for an example.	No
Case sensitive	Determines how the join keys are compared. Checked by default. Uncheck this parameter to compare the join keys records in lower case format. But we recommend disabling case sensitive comparisons only when join keys are text fields that may contain mixed case data (e.g., names, descriptions). Case insensitive mode is slower and more resource-intensive, and is not needed for numeric values or standardized codes.	No
Multivalued and Separator	In some cases, the remote dataset may contain more than one row that matches the local key. In that case, you can decide to have them be listed in a single line, separated by the symbol you decide. By default, this is "/".	No
Reprocess all records on schedule	Triggers a full re-processing of all records upon publication if the remote dataset has changed (if not selected, the incremental mode is used by default). This option ensures that all records are joined with the most recent version of the remote dataset, which is practical when the remote dataset is regularly updated. This option is only available if a scheduler has been set for the dataset.	No

Example of a simple join

Dataset A: list of taxi stations in Paris (dataset name: paris_taxis_stations).

station_id	station_name	station_address
1	Tour Eiffel	69 quai Branly, 75007 Paris
2	Rennes - Montparnasse	1 place du dix huit Juin 1940, 75006 Paris

Dataset B: number of taxis waiting per station in Paris.

station_id	number
1	10
2	15

Dataset A after being enriched with a field from dataset B, using the Join dataset processor:

station_id	number	station_name	station_address
1	10	Tour Eiffel	69 quai Branly, 75007 Paris
2	15	Rennes - Montparnasse	1 place du dix huit Juin 1940, 75006 Paris

Example of a join with "Multivalued" parameter set

Dataset A contains two rows for the first station:

station_id	station_name	station_address
1	Tour Eiffel	69 quai Branly, 75007 Paris
1	Quai Branly	69 quai Branly, 75007 Paris
2	Rennes - Montparnasse	1 place du dix huit Juin 1940, 75006 Paris

If "Multivalued" is set (with "|" set as separator), the join will result in:

station_id	number	station_name	station_address
1	10	Tour Eiffel\|Quai Branly	69 quai Branly, 75007 Paris\|69 quai Branly, 75007 Paris
2	15	Rennes - Montparnasse	1 place du dix huit Juin 1940, 75006 Paris

However, if "Multivalued" is not set, the join will result in:

station_id	number	station_name	station_address
1	10	Tour Eiffel	69 quai Branly, 75007 Paris
1	10	Quai Branly	69 quai Branly, 75007 Paris
2	15	Rennes - Montparnasse	1 place du dix huit Juin 1940, 75006 Paris

An example of using the Output fields prefix option

You may wish to make several joins in a single dataset based on the same field. In the example below, we have a dataset listing the terminus stations (start and end) for each Paris Metro line.

line_id	station_name_start	station_name_end
4	Bagneux	Porte de Clignancourt

Let's imagine that for each Metro line, I wish to enrich my dataset with each terminus station's address. I have access to a second dataset with the addresses for each station in the entire network.

station_id	station_name	station_address
3	Bagneux	1 rue du Verdun, 92220 Bagneux
5	Porte de Clignancourt	83 Bd Ornano, 75018 Paris

To retrieve the address of the start station, we create a join with the second dataset using "station_name" and "station_name_start" as the join keys, with "station_address" as the output field.

But when I create a second join to retrieve the address of the end station, I will get the same output field "station_address"—and this will overwrite the previous one. Instead, I can use the "Output fields prefix" option to distinguish the two consecutive output fields.

In the first join I add the prefix "START_", and in the second I add "END_". In this way, I can perform two otherwise identical joins, and distinguish them in my final dataset.

line_id	station_name_start	station_name_end	START_station_address	END_station_address
4	Bagneux	Porte de Clignancourt	69 quai Branly, 75007 Paris	1 place du dix huit Juin 1940, 75006 Paris

Can I join fields with different types?

Yes. Ideally, you should match field types to avoid any unexpected behavior. However, integers, decimals, and text fields containing numerical values can be joined together.

Note that in the past, text fields did not take into account zeros at the beginning of a value, so that the text value "01262" would match with the integer value "1262". This is no longer the case.
When a text field value begins with a zero, the zero is taken into account in a join with a numerical value.