Join datasets processor
This processor allows you to join two datasets together based on a field or fields common to both datasets.
Note that this processor is no longer limited by the number of records in the distant dataset! You can now easily cross-reference your own datasets with large distant datasets, such as France’s SIREN (a directory of ID codes for every business in the country).
For example, imagine you want to enrich dataset A with data from dataset B. If both datasets have a field in common, it's possible to join them together, and thus be able to combine other data from dataset B with those in dataset A.
Note that if you use this processor to join your dataset with another that contains geoshapes, this can considerably increase how much memory your dataset uses. If you require geoshapes to produce a map but want to avoid this issue, remember that you can create your map in Studio, where Opendatasoft provides you with options for geographic data.
Configuring the processor
Set the parameters of the Join datasets processor, according to the information in the table below.
Label | Description | Mandatory |
Dataset | Dataset used for the join. It can be selected from the portal's datasets or from Opendatasoft's network. | Yes |
Local keys | Local field for the records you wish to use to join with the other dataset. More than one key can be specified. | Yes |
Remote keys | Remote field used to identify the matching records in the remote dataset. More than one key can be specified. | Yes |
Retrieve all fields on publish | Check this box to retrieve all the fields from the remote dataset | No |
Output fields | List of fields to retrieve from the remote dataset | Yes |
Output fields prefix | You can add a prefix to be added to the output fields. This allows you to make several joins with the same distant dataset and retrieve the same distant fields. See below for an example. | No |
Case sensitive | Set to retrieve the field name without accents and in lower case format. Selected by defaut. | No |
Multivalued and Separator | In some cases, the remote dataset may contain more than one row that matches the local key. In that case, you can decide to have them be listed in a single line, separated by the symbol you decide. By default, this is "/". | No |
Reprocess all records on schedule | Triggers a full re-processing of all records upon publication if the remote dataset has changed (if not selected, the incremental mode is used by default).
| No |
Example of a simple join
Dataset A: list of taxi stations in Paris (dataset name: paris_taxis_stations).
station_id | station_name | station_address |
1 | Tour Eiffel | 69 quai Branly, 75007 Paris |
2 | Rennes - Montparnasse | 1 place du dix huit Juin 1940, 75006 Paris |
Dataset B: number of taxis waiting per station in Paris.
station_id | number |
1 | 10 |
2 | 15 |
Dataset A after being enriched with a field from dataset B, using the Join dataset processor:
station_id | number | station_name | station_address |
1 | 10 | Tour Eiffel | 69 quai Branly, 75007 Paris |
2 | 15 | Rennes - Montparnasse | 1 place du dix huit Juin 1940, 75006 Paris |
Example of a join with "Multivalued" parameter set
Dataset A contains two rows for the first station:
station_id | station_name | station_address |
1 | Tour Eiffel | 69 quai Branly, 75007 Paris |
1 | Quai Branly | 69 quai Branly, 75007 Paris |
2 | Rennes - Montparnasse | 1 place du dix huit Juin 1940, 75006 Paris |
If "Multivalued" is set (with "|
" set as separator), the join will result in:
station_id | number | station_name | station_address |
1 | 10 | Tour Eiffel|Quai Branly | 69 quai Branly, 75007 Paris|69 quai Branly, 75007 Paris |
2 | 15 | Rennes - Montparnasse | 1 place du dix huit Juin 1940, 75006 Paris |
However, if "Multivalued" is not set, the join will result in:
station_id | number | station_name | station_address |
1 | 10 | Tour Eiffel | 69 quai Branly, 75007 Paris |
1 | 10 | Quai Branly | 69 quai Branly, 75007 Paris |
2 | 15 | Rennes - Montparnasse | 1 place du dix huit Juin 1940, 75006 Paris |
An example of using the Output fields prefix option
You may wish to make several joins in a single dataset based on the same field. In the example below, we have a dataset listing the terminus stations (start and end) for each Paris Metro line.
line_id | station_name_start | station_name_end |
4 | Bagneux | Porte de Clignancourt |
Let's imagine that for each Metro line, I wish to enrich my dataset with each terminus station's address. I have access to a second dataset with the addresses for each station in the entire network.
station_id | station_name | station_address |
3 | Bagneux | 1 rue du Verdun, 92220 Bagneux |
5 | Porte de Clignancourt | 83 Bd Ornano, 75018 Paris |
To retrieve the address of the start station, we create a join with the second dataset using "station_name" and "station_name_start" as the join keys, with "station_address" as the output field.
But when I create a second join to retrieve the address of the end station, I will get the same output field "station_address"—and this will overwrite the previous one. Instead, I can use the "Output fields prefix" option to distinguish the two consecutive output fields.
In the first join I add the prefix "START_", and in the second I add "END_". In this way, I can perform two otherwise identical joins, and distinguish them in my final dataset.
line_id | station_name_start | station_name_end | START_station_address | END_station_address |
4 | Bagneux | Porte de Clignancourt | 69 quai Branly, 75007 Paris | 1 place du dix huit Juin 1940, 75006 Paris |
Can I join fields with different types?
Yes. Ideally, you should match field types to avoid any unexpected behavior. However, integers, decimals, and text fields containing numerical values can be joined together.
Note that in the past, text fields did not take into account zeros at the beginning of a value, so that the text value "01262" would match with the integer value "1262". This is no longer the case.
When a text field value begins with a zero, the zero is taken into account in a join with a numerical value.