Join datasets processor
This processor allows you to join two datasets together—assuming they have a common field to be joined.
For example, imagine you want to enrich dataset A with data from dataset B. If datasets A and B have a field in common, it's possible to join these datasets together, and thus be able to use the data from dataset B with those in dataset A.
When applying the join dataset processor to a scheduled dataset, the join is only reapplied when the original resource has updated. If a scheduled dataset A is joined to dataset B, and dataset B has updated but dataset A has not, the scheduled processing will not pull the new values from B.
By default, the Join dataset processor can only be used with remote datasets that have fewer than 100,000 records.Note also that if you use this processor to join your dataset with another that contains geoshapes, this can considerably increase how much memory your dataset uses. If you need geoshapes to produce a map, an alternative is to create your map in Studio, where Opendatasoft can provide you with the necessary geographic data.
Setting the processor
To set the parameters of the Join datasets processor, follow the indications from the table below.
Label | Description | Mandatory |
Dataset | Dataset used for the join. It can be selected from the portal's datasets or from Opendatasoft's network. | Yes |
Local Key | Local field that will be used to identify the corresponding records in the remote dataset. More than one key can be specified. | Yes |
Remote Key | Remote field corresponding to the local key. It can be a list. | Yes |
Output Fields | List of fields to retrieve from the remote dataset | Yes |
Retrieve All Fields | Set to retrieve all the fields from the remote dataset | No |
Case Sensitive | Set to retrieve the field name without accents and in lower case format | No |
One line | In some specific cases, the remote dataset may contain more than one row matching the local key. In which case, you may want to either collapse duplicates (that is, generate a single row which will contain multi-valued fields) or not. | No |
Separator | To fill if One line is set. Specifies the character to use to separate values in the generated field. | No |
Republish all records upon update | Activates scheduling on the dataset, if it is not already activated (see Keeping data up to date). Once scheduling is configured, this option allows an automatic republish when the remote dataset is updated.
| No |
Examples
Simple join
Dataset A: list of taxi stations in Paris (dataset name: paris_taxis_stations).
station_id | station_name | station_address |
1 | Tour Eiffel | 69 quai Branly, 75007 Paris |
2 | Rennes - Montparnasse | 1 place du dix huit Juin 1940, 75006 Paris |
Dataset B: number of taxis waiting per station in Paris.
station_id | number |
1 | 10 |
2 | 15 |
Dataset A after being enriched with a field from dataset B, using the Join dataset processor:
station_id | number | station_name | station_address |
1 | 10 | Tour Eiffel | 69 quai Branly, 75007 Paris |
2 | 15 | Rennes - Montparnasse | 1 place du dix huit Juin 1940, 75006 Paris |
Join with "One line" parameter set
The example datasets used in the following example are the same as in the previous one.
Dataset A contains two rows for the first station:
station_id | station_name | station_address |
1 | Tour Eiffel | 69 quai Branly, 75007 Paris |
1 | Quai Branly | 69 quai Branly, 75007 Paris |
2 | Rennes - Montparnasse | 1 place du dix huit Juin 1940, 75006 Paris |
If One line is set (with |
set as separator), the Join will result in:
station_id | number | station_name | station_address |
1 | 10 | Tour Eiffel|Quai Branly | 69 quai Branly, 75007 Paris|69 quai Branly, 75007 Paris |
2 | 15 | Rennes - Montparnasse | 1 place du dix huit Juin 1940, 75006 Paris |
However, if One line is not set, the Join will result in:
station_id | number | station_name | station_address |
1 | 10 | Tour Eiffel | 69 quai Branly, 75007 Paris |
1 | 10 | Quai Branly | 69 quai Branly, 75007 Paris |
2 | 15 | Rennes - Montparnasse | 1 place du dix huit Juin 1940, 75006 Paris |
Join with different field types
Integers, decimals and text field containing numerical values can be joined together.
Dataset A: With decimal fields and text fields containing numerical values.
insee_code (text) | bloom_competition_result (decimal) |
01262 | 2.0 |
90010 | 4.0 |
57355 | 2.0 |
Dataset B: With integer fields.
bloom_ranks (integer) |
2 |
Dataset A after using Join dataset: the matching between values 2
and 2.0
is successful despite the type difference.
insee_code (text) | bloom_competition_result (decimal) |
01262 | 2 |
57355 | 2 |
It is possible to a second Join dataset processor, such as this dataset C:
city (text) | insee_code (integer) | postal_code (text) |
Montluel | 1262 | 01120 |
Belfort | 90010 | 90000 |
Kalhausen | 57355 | 57412 |
Dataset A after being joined with datasets B and C:
insee_code (text) | bloom_competition_result (decimal) | city (text) | postal_code (text) |
01262 | 2 | Montluel | 01120 |
57355 | 2 | Kalhausen | 57412 |
Even though the insee_code was not in the same type, the matching happened. The matching worked even for the value 1262
in the first dataset (note the absence of leading 0, due to it being an integer value), that matched against the value 01262
in the second dataset. While most column types can be retrieved by using the Join dataset processor, file type columns do not yield the actual resource through the processor, but instead yield the identifier of the underlying resource.