Defining a dataset schema
Updated by Anthony Pépin
Each dataset follows a schema. Fully defining and configuring a dataset schema means:
- Changing a dataset field label
- Adding a description to a dataset field
- Choosing a type for each field
- Setting up relevant fields as facets to define filters for the dataset
- Configuring options for each field, depending on their type and whether they are set up as facets or not
- Ordering the fields in the dataset
- Discarding fields from the dataset
1) Changing a label
The Opendatasoft platform retrieves the default field labels found in the source dataset. However, it is possible to change each dataset field label.
To change a label:
- In the preview area of the Processing tab, select the current label of the field of your choice.
- Type a new label. It can contain special characters.
- Click outside the label area or press Enter for the changes to be taken into account.
Remember that in general you should avoid changing your technical identifiers. See the note below under "Name" for more information.
2) Adding a description to a field
Descriptions can be added to dataset fields for more context or precision.
To add a description to a field:
- In the preview area of the Processing tab, click in the Description text box of the field of your choice.
- Type the related description.
- Click outside the Description text box for the description to be taken into account.
Field descriptions will then be displayed in the front office:
- In the "Data schema" section of the Information tab of the dataset
- In the tooltip of their related fields, in the Table tab
3) Choosing a field type
Fields are characterized by types. Depending on the chosen field type, the platform will process and display records in a specific way.
To choose a type for a field:
- In the preview area of the Processing tab, click on the field type list box of the field of your choice.
- Choose the right type for this field.
There are eight different types: date, datetime, decimal, integer, geopoint, geoshape, text, and file.
Type | Description |
Text | Field values are textual data. |
Integer | Field values are integer numbers. Note that if a value contains a decimal, only the whole number is retained (the decimal value is removed). For example, if the value is 1.9, the resulting integer is 1. |
Decimal | Field values are decimal numbers. Valid separators for the decimal part are |
DateTime | Field values are a combination of a date and a time. The ideal format is the ISO 8601 format, which is The platform will try to guess as accurately as possible the input datetime format. However, in case of bad detection or ambiguity, use the Normalize Date processor to define the parsing format of the datetime field. By default, time records are in the UTC timezone. To change the timezone, use the Set Timezone processor. |
Date | Field values are dates. The ideal format is the ISO 8601 format, which is The platform will try to guess as accurately as possible the input date format. However, in case of bad detection or ambiguity, use the Normalize Date processor to define the parsing format of the date field. |
Geopoint | Field values are a single geographical location expressed in the format If your dataset contains two fields, latitude and longitude, use the Create GeoPoint processor to create a valid geo point field. |
Geoshape | Field values are geographical shapes expressed in GeoJSON. For example: {"type": "LineString", Feature collections are not supported. |
IP address | Field values are IP addresses in the usual IPv4 format, such as 192.0.2.22 (four one- to three-digit numbers separated by periods: 0.0.0.0 to 255.255.255.255). The option is available to "Anonymize IP address," which converts the last number in the address to a 0. Our example address above would therefore become 192.0.2.0. Note that this does not technically anonymize the address, but only renders it less specific. You remain responsible for the availability of personally identifiable data you publish. Note that while you could feasibly use a text field to store IP addresses, doing so would not allow you to "anonymize" the addresses in the way indicated above, and actions that process IP addresses via the API (for example, aggregation by IP address and distinct counts) are faster when the IP address type is used. |
File | This field type is only available in cases where field values are files sourced with one of the available methods to create a dataset with media files (with the File processor, through an archive file, or with a specific extractor), creating a field which default type is file. |
4) Setting up fields as facets
Facets define the filters of a dataset, which are displayed on the left of the dataset's visualization in the front office. These filters have several purposes: they allow the users to find specific, precise records into a dataset, but they also allow the creation of charts afterward (if a field's records are something other than numbers, they will be usable in the Chart Builder and in the Analyze view only if they are set up as facets).
To set up a field as a facet:
- In the preview area of the Processing tab, choose the field you want to set up as a facet. Choose a field relevant enough to become a filter.
- Click the funnel button.
5) Configuring field options
- In the preview area of the Processing tab, click the wheel button of the field of your choice.
- Configuration options change depending on the type of the field and on whether the field is set up as a facet or not. Follow the right instructions depending on the chosen configuration option.
Configuration options available for every field
Configuration option | Description |
Name | Technical identifier of the field. In contrary to the label, the technical identifier does not have an esthetic purpose and thus cannot contain special characters, including spaces. Technical identifiers can be used, for example, when creating a custom tooltip with HTML. Changing the technical identifier of a field could break reuses of the related dataset (custom tooltip, custom tab, or pages). It can break processors that make reference to that identifier. It could also be a problem if the source of the dataset is (regularly) updated: when replacing a source with a newer one, the platform checks the technical identifier of the fields of both sources in order to find a match between the two -then replacing the old data with the new ones. If technical identifiers are not the same anymore, the dataset cannot be updated. To change the technical identifier of a field:
|
Unique ID | Each record is uniquely identified by its identifier, which is by default computed as the fingerprint of all the record's field values. If the Unique ID option is activated for a field, records with the same identifier (or value) are deleted for only the last/oldest one to stay in the dataset. It is most useful for real-time datasets to make sure that instead of adding new records every time the dataset is updated, new values replace the old ones. To activate the Unique ID option of a field: Tick the "Unique ID" box. |
Configuration options for date & datetime fields
Configuration option | Description |
Precision | Degree of precision of the date or datetime field. To define a degree of precision:
For date fields:
For datetime fields:
As for the datetime precisions, the full datetime (hour and minutes) is displayed in the dataset. The difference is in the Analyze view and in the Chart Builder where the degree of precision is available to configure the chart. |
Configuration options for decimals & integer fields
Configuration option | Description |
Unit | Unit of the field, to display along with the decimal or integer value. To define a unit for a field:
|
Enforce the number of decimals to display | The chosen number of decimals for all values of the field. To define a number of decimals for a field:
|
Configuration options for text fields
Configuration option | Description |
Sortable | By default, only numerical fields (decimals and integers) are sortable. This option activates sorting on textual fields. In the Table view of the dataset in the front office, it is then possible to sort text fields alphabetically. To make a field sortable: Tick the "Sortable" box. |
Multivalued | This option is for multivalued records separated by one same separator. Example: France,UK,USA When set up as a facet, each of the field's records values appears as a separate entry in the filters section. When clicking on one of the entry, all the other entries which are not related (meaning the entries which never appear in the same record as part of a multivalues combination) automatically disappear -only the related entries remain as available filter entries. To activate the multivalued option:
|
Hierarchical | This option is for multivalued records, separated by one same separator and which have a hierarchical relation. Example: France/Ile-de-France/Paris When set up as a facet, each first value of each record's multivalues combination appears as a separate entry in the filters section. When clicking on one entry, all second-level values related to that entry appear, and so on. Example: After clicking on the filter entry France, the related second-level entry Ile-de-France appears. After clicking on Ile-de-France, the related third-level entry Paris appears. To activate the hierarchical option:
|
Display only for specific languages | This option is only available for domains that support more than one language. It allows to choose whether a field should be displayed in only specific languages. A common usage would be when your dataset has separate columns for the same information—job titles, for example—in different languages. This way, you can ensure that only the relevant version of that column's data appears in a given language. The configuration of this option affects neither the data itself nor the API. It only affects visualizations in the front office of the platform. To make a field displayed for specific languages only:
|
Configuration options for facets
Configuration option | Description |
Sort facet by | Chosen sorting of the field's entries in the filters section of the front office. To chose a sorting order for a facetted field:
|
Allow multiple selection in filters | By default, after clicking on a filter entry, the others disappear. With the multiple selection option, it is possible to select several filter entries. To allow multiple selection in filters: Tick the "Allow multiple selection in filters" box. |
6) Ordering fields in datasets
Dataset fields can be reordered in their dataset. It can have two kinds of impact:
- In the filters section of the dataset, in the front office. Changing the order of the fields in the dataset also changes the order of the filters. The first facetted field of the dataset becomes the first displayed filter, and so on.
- In case the dataset contains geo shape fields. The Map view cannot display more than one layer of geo shapes. The geo shapes layer displayed by default must thus be defined. To do so, the field containing the geo shapes to be displayed by default on the map must the ordered before all other geo shapes fields.
To reorder a field in a dataset:
- In the preview area of the Processing tab, click on the reorder button of the field you want to reorder in the dataset.
- While maintaining the click on the Reorder button, drag the field to its new position in the dataset.
- Once the field in its new position in the dataset, stop maintaining the click.
7) Discarding fields from datasets
Dataset fields can be discarded from the dataset. It does not mean that the field is completely removed from the dataset but only deleted from the output. This is why, once the dataset is published, the discarded field will not be displayed in any visualization, and if the dataset is exported, the discarded field will not be in the export.
To discard a field from a dataset: Click on the trash button of the field you want to discard from the dataset.
Since discarded fields are not completely removed from the dataset, they can be restored at any time.
To restore a discarded field from a dataset:
- In the preview area of the Processing tab, swipe to the right to go to the last fields of the datasets.
- The discarded fields of the dataset appear at the very end of the dataset. They look like blank, grey columns named by their technical identifiers. Find those you want to restore.
- Click on the reorder button.