Defining a dataset schema

Edited

Every dataset has a schema, which defines the kind of data your dataset contains and how it's organized.

When creating a dataset from an existing data source, the platform does its best to interpret and automatically identify the schema. But it's up to you to make sure your dataset's schemas are complete and well-organized.

This is done on a dataset's Schema tab.

Editing your schema

On the Schema tab, you'll see the list of fields in your dataset.

To edit a field, roll over it and click the pencil icon. (Afterwards, don't forget to click Apply in the lower-right to save any changes you make.)

This will open an interface that allows you to fully define and configure your dataset schema:

  1. Using identifiable labels, and IDs

  2. Adding a description to a dataset field

  3. Defining the appropriate type for each field

Instead, you can also reorder the fields by dragging and dropping them using the using the grip icon on the left, or else by using the up arrow and down arrow.

Dataset fields can be "deleted" from the dataset by clicking on the trash can icon.

This does not mean that the field is completely removed from the dataset but only removed from the output. Once the dataset is published, the deleted field will not be displayed in any visualization, and if the dataset is exported, the deleted field will not be in the export.

Discarded fields appear, grayed-out, at the end of the schema.

To restore a discarded field from a dataset, roll over it and click on the circular arrow icon.

1) Using identifiable labels, and IDs

When it can, the platform retrieves the field labels from a source dataset.

Nevertheless, we encourage you to take the time to use well-written, explicit labels. These are what is visible in the portal, so to make sure a wider audience can understand the data, if possible use simple terms instead of a business-specific vocabulary.

To change a label, enter in the correct value under "Field."

Note that the field's labe and technical ID are not the same thing. In general you should avoid changing your technical IDs, though if you do note that they should not contain any special characters.

Changing the technical identifier of a field could break reuses of the related dataset (custom tooltip, custom tab, or pages). It can break processors that make reference to that identifier. It could also be a problem if the source of the dataset is regularly updated: when replacing a source with a newer one, the platform checks the technical identifier of the fields of both sources in order to find a match between the two. If the technical identifiers are not the same anymore, the dataset is not updated.

"Unique ID" toggle: Each record is uniquely identified by its identifier, which is by default computed as the fingerprint of all the record's field values. If the Unique ID option is activated for a field, records with the same identifier (or value) are deleted for only the last/oldest one to stay in the dataset. It is most useful for real-time datasets to make sure that instead of adding new records every time the dataset is updated, new values replace the old ones.

2) Adding a description

Descriptions can be added to dataset fields for more context or information.

To add a description, enter it under "Description (optional)."

3) Choosing a field type

Fields are characterized by types. Depending on the chosen field type, the platform will process and display its records in a specific way.

To choose a type, select it from the list under "Type." There are eight different types: text, integer, boolean, double, datetime, date, geo point, geo shape, IP address, and file.

Depending on the type, you're able to further define the field.

Type

Description

Text

Field values are textual data.

Two toggles allow you to specify if the values are multi-valued or hierarchical:

The "multivalued" option is for records separated by one same separator. Example: France,UK,USA When set up as a filter, each of the field's records values appears as a separate entry in the filters section. When clicking on one of the entries, all the other entries which are not related (meaning the entries which never appear in the same record as part of a combination) automatically disappear—only the related entries remain as available filter entries.

The "hierarchical" option is for multivalued records, separated by one same separator and that have a hierarchical relation. Example: France/Ile-de-France/Paris When used as a filter, the first value of each multi-value combination appears as a separate entry in the filters section. When clicking on one entry, all second-level values related to that entry appear, and so on. Example: After clicking on the filter entry France, the related second-level entry Ile-de-France appears. After clicking on Ile-de-France, the related third-level entry Paris appears.

Integer

Field values are integer numbers. Note that if a value contains a decimal, only the whole number is retained (the decimal value is removed). For example, if the value is 1.9, the resulting integer is 1.

You may define the unit from the list.

Boolean

A true or false value.

Double

For decimal figures. Valid separators for the decimal part are . or ,. To define a specific number of decimal units, toggle "Enforce number of decimals to display."

You may define the unit from the list.

DateTime

Field values are a combination of a date and a time. The ideal format is the ISO 8601 format, which is YYYY-mm-ddTHH:MM:ss+00:00YYYY-mm-ddTHH:MM:ssZ or YYYYmmddTHHMMssZ. Other formats are also understood by the platform, such as YYYY-mm-dd-HH:MM:ss or or YYYY-mm-dd HH:MM:ss.

The platform will try to guess as accurately as possible the input datetime format. However, in case of bad detection or ambiguity, use the Normalize Date processor to define the parsing format of the datetime field.

By default, time records are in the UTC timezone. To change the timezone, use the Set Timezone processor.

You may define the precision from the list (hour, minute).

The full datetime (hour and minutes) is displayed in the dataset. The difference is in the Analyze view and in the Chart Builder where the degree of precision is available to configure the chart.

Date

Field values are dates. The ideal format is the ISO 8601 format, which is YYYY-mm-dd. Other formats are also understood by the platform, such as YYYY/mm/dd or dd/mm/YYYY.

The platform will guess the appropriate input date as accurately as possible. However, in cases where it's ambiguous or incorrectly detected, use the Normalize Date processor to define how the date field should be parse.

You may define the precision from the list:

  • Year: Only the year of the date is displayed in the dataset

  • Month: Only the month and year of the date are displayed in the dataset

  • Day: The full date (day, month, and year) is displayed in the dataset

Geopoint

Field values are a single geographical location expressed in the format <LAT>,<LON>, for example, 45.8,2.5.

If your dataset contains two fields, latitude and longitude, use the Create GeoPoint processor to create a valid geo point field.

Geoshape

Field values are geographical shapes expressed in GeoJSON. For example:

{"type": "LineString",
 "coordinates": [ [100.0, 0.0], [101.0, 1.0] ]}

Feature collections are not supported.

IP address

Field values are IP addresses in the usual IPv4 format, such as 192.0.2.22 (four one- to three-digit numbers separated by periods: 0.0.0.0 to 255.255.255.255).

A toggle allows you to "Anonymize IP address," which converts the last number in the address to a 0. Our example address above would therefore become 192.0.2.0.

Note that this does not technically anonymize the address, but only renders it less specific. You remain responsible for the availability of personally identifiable data you publish.

Note that while you could feasibly use a text field to store IP addresses, doing so would not allow you to "anonymize" the addresses in the way indicated above, and actions that process IP addresses via the API (for example, aggregation by IP address and distinct counts) are faster when the IP address type is used.

File

This field type is only available in cases where field values are files sourced with one of the available methods to create a dataset with media files (with the File processor, through an archive file, or with a specific extractor), creating a field which default type is file.