Counting unique values via the Explore API

Edited

When analyzing data, you might want to know how many unique values exist for a particular field—for example, how many different species are recorded in a dataset about animals in a given country.

You can do this using the count (distinct) function in the Explore API.

This article explains how to use count(distinct ...) provides examples, and importantly it explains why the returned result is an approximation

What is the count (distinct) function?

The count(distinct ...) function lets you calculate the number of unique (non-null) values for a specific field in your dataset. It can be used in both the select and order_by clauses of your query.

Syntax:

count(distinct <field>)

Or, if you want to include null values under a custom label:

count(distinct ifnull(<field>, <alternative_value>))

Examples

For example, based on a dataset about animals in a given country:

Use case

Query

Description

Count how many unique species exist

count(distinct species)

Returns the number of distinct non-null values in the species field

Count unique species, treating null as "unknown"

count(distinct ifnull(species, "'unknown'"))

Counts null values as "unknown" and includes them in the result

You can also use it with group by:

select=count(distinct species)&group_by=region

This returns the number of unique species per region.

Why is the result only an approximation?

When you use count(distinct ...), the system estimates how many different values exist in the field. But instead of checking every single row—which could be billions of records—it uses a clever shortcut to save time and resources.

How it works

Our platform uses Elasticsearch under the hood. For counting unique values, it uses a fast estimation method called "cardinality aggregation," powered by an algorithm known as HyperLogLog++.

This algorithm is like a statistical shortcut:

Imagine you're trying to count how many different types of birds visit a huge forest.

Instead of watching every bird one by one, you observe a few areas and estimate based on what you see.

It’s fast, and usually very close to the true number—but it’s not perfect.

The results may not be perfectly accurate

  • When the dataset contains a very large number of unique values

  • When values in the field are very random or rare

So why use this method?

  • It works very quickly, even with large datasets

  • It saves memory and computing power

  • It gives a result that is almost always close enough for analysis

    • Exact if the number of distinct values in the field is lower than 200

    • Otherwise, the error is always less than 6%

So it's possible that the estimate produced by the function may differ slightly from the true count, but for most practical uses this difference is not a problem.

In conclusion

  • count(distinct ...) is useful for counting unique values in a dataset

  • It works in select and order_by clauses

  • You can use ifnull(...) to include null values

  • Results are an approximation

  • In most use cases, the result is fast, efficient, and accurate enough