Filtering data within a dataset

Edited

When searching in datasets with text, it's possible to make more advanced searches using query language. Below are the different terms and methods you can use.

Examples of common searches

The table below gives you examples for the most commonly used kinds of searches.

The records you're searching for

Syntax

Example

What the search returns

Those containing your search term

term

apricot

Returns every record containing the string "apricot"

Those containing a specific search term in a specific field

id_field:term

fruit:apple

Returns every record containing the string "apple" in the column named "fruit"

Those containing several values at once

value1 AND value2

apple AND orange

Returns every record containing both "apple" and "orange"

Those containing at least one of mutiple values

value1 OR value2

apple OR orange

Returns every record containing either "apple" or "orange"

Records that don't contain your search term term

NOT term

NOT banana

Returns every record which does not contain the string "banana"

Records containing an exact term in a specific field

#exact(id_field, "term")

#exact(plant,"fruit tree")

Returns every record containing the exact string "fruit tree" in the column named "plant."

Records where the fields are empty

#null(id_field)

#null(plant)

Returns every record which has no value in the column named "plant"

Records containing a value that starts with a specified string, in a specified field

#startswith(id_field, "string")

#startswith(fruit,"ap")

Returns every record containing a value that starts with "ap" (apple or apricot, say) in the column named "fruit"

Records from before a certain date

id_date_field <= YYYY/MM/DD

timestamp<=2022/09

Returns every record for which the value in the "timestamp" field is equal to or prior to September 2022

Records from a specific number of days before a specified date

id_date_field > #now(days=-value)

timestamp>#now(days=-7)

Returns every record for which the value in the "timestamp" field is equal to the current day minus 7 days

See below for more detail about the "now" condition.

Records located inside a specific area

For a circular area:

#distance("latitude,longitude", distance in meters)

For a polygon or square area:

#polygon(geo_field, "(lat1,lon1), (lat2,lon2), (lat3,lon3)")

#distance("48.866667,2.333333",1000)

#polygon(coordinates,"(40.17887331434696,-7.3828125),(52.05249047600099,-7.3828125),(52.05249047600099,16.171875),(40.17887331434696,16.171875),(40.17887331434696,-7.3828125)")

Returns every record located 1 km from the center of Paris

Returns every record located inside the defined polygon

Records containing values that resemble the specified term

#search(id_field, "term")

#search(title, "bok of secret")

Returns records using a fuzzy search. Here, for example, values such as "book of secrets" will be returned, not just "bok of secret"

Query language in more detail

The Opendatasoft query language makes it possible to express complex boolean conditions as a filtering context.

Full-text search

The query language accepts full-text queries.

If a given word or compound is surrounded with double quotes, only exact matches are returned (the search ignores accents and capital letters).

  • film returns results that contain film, films, filmography...

  • "film" only returns the ones containing exactly film.

Boolean expressions

The query language supports the following Boolean operators ANDOR, and NOT.

Parenthesis can be used to group together expressions and alter the default priority model:

  • NOT

  • AND

  • OR

Samples

  • film OR trees

  • (film OR trees) AND paris

Field queries

One of the major features of the query language is to allow per-field filtering. You can use field names as a prefix to your queries to filter the results based on a specific field's value.

The list of available fields corresponds to the available metadata for that dataset. The domain administrator might define a richer metadata template, thus giving access to a richer set of filtering fields. But by default, the following fields are available:

Field name

Description

publisher

The dataset publisher

title

The dataset title

description

The dataset description

license

The dataset license

records_count

The number of records in the dataset

modified

The last modification date of the dataset

language

The language of the dataset (iso code)

theme

The theme of the dataset

keyword

A keyword describing the dataset

references

The references for the dataset

territory

The territory of the dataset

Remember: When searching within a dataset for specific records, the list of available fields depends on the schema of the dataset. To fetch the list of available fields for a given dataset, you may use the search dataset or lookup dataset APIs.

Multiple operators can be used between the field name and the query:

  • :===: Return results whose field exactly matches the given value (granted the fields are of text or numeric type)

  • ><>=<=: Return results whose field values are larger, smaller, larger or equal, smaller or equal to the given value (granted the field is of date or numeric type).

  • [start_date TO end_date]: Queries Records whose date is between start_date and end_date.

Date formats can be specified in different formats: simple (YYYY[[/mm]/dd]) or ISO 8601 (YYYY-mm-DDTHH:MM:SS)

Examples:

  • film_date >= 2002

  • film_date >= 2013/02/11

  • film_date: [1950 TO 2000]

  • film_box_office > 10000 AND film_date < 1965

Query language functions

Advanced functions can be used in the query language. Function names need to be prefixed with a pound sign, or hash symbol: #.

Function name

Description

now

Returns the current date. This function may be called as a query value for a field. When called without an argument, it will evaluate to the current datetime: birthdate >= #now() returns all Records containing a birth date greater or equal to the current datetime. This function can also accept parameters. See below for the #now function available parameters.

null

This function may be called specifying a field name as a parameter. It returns the hits for which no value is defined for the specified field. For example #null(birthdate)

exact

This function makes it possible to search for records with a field exactly matching a given value. For example, #exact(firstname, "Marie") will return records with a field firstname containing exactly "Marie" and nothing else.

attr

This function makes it possible to search for records with a field matching a value in the querying user's SAML attributes. For example, #attr(firstname, user_first_name), performed by a user who has a SAML attribute user_first_name with a value of "Marie", will return records with a field firstname containing "Marie".

Available parameters for the #now function:

  • Years, months, weeks, days, hours, minutes, seconds, microseconds: These parameters add time to the current date.
    For example, #now(years=-1, hours=-1) returns the current date minus a year and an hour.

  • Year, month, day, hour, minute, second, microsecond: can also be used to specify an absolute date.
    For example, #now(year=2001) returns the current time, day, and month for the year 2001.

  • Weekday: Specifies a day of the week. This parameter accepts either an integer between 0 and 6 (where 0 is Monday and 6 is Sunday) or the first two letters of the day (in English) followed by the cardinal of the first week on which to start the query.
    #now(weeks=-2, weekday=1) returns the Tuesday before last.
    #now(weekday=MO(2)) returns Monday after next.

Geo-filtering

Records search accept geo-filter parameters to filter in records that are located in a specific geographical area.

The following parameters may be used.

Parameter name

Description

geofilter.distance

Limits the result set to a geographical area defined by a circle (coordinates of the center of the circle expressed in WGS84 and distance expressed in meters): latitude,longitude,distance: geofilter.distance=48.8520930694,2.34738897685,1000

This query parameter is equivalent to the #distance function described at the beginning of this documentation.

geofilter.polygon

Limits the result set to a geographical area defined by a polygon (coordinates of the points expressed in WGS84 as in (lat1,lon1),(lat2,lon2),(lat3,lon3): geofilter.polygon=(48.883086,2.379072),(48.879022,2.379930),(48.883651,2.386968)