Counting unique values via the Explore API
When analyzing data, you might want to know how many unique values exist for a particular field—for example, how many different species are recorded in a dataset about animals in a given country.
You can do this using the count (distinct) function in the Explore API.
This article explains how to use count(distinct ...)
provides examples, and importantly it explains why the returned result is an approximation
What is the count (distinct) function?
The count(distinct ...)
function lets you calculate the number of unique (non-null) values for a specific field in your dataset. It can be used in both the select
and order_by
clauses of your query.
Syntax:
count(distinct <field>)
Or, if you want to include null values under a custom label:
count(distinct ifnull(<field>, <alternative_value>))
Examples
For example, based on a dataset about animals in a given country:
Use case | Query | Description |
Count how many unique species exist |
| Returns the number of distinct non-null values in the |
Count unique species, treating null as "unknown" |
| Counts null values as |
You can also use it with group by
:
select=count(distinct species)&group_by=region
This returns the number of unique species per region.
Why is the result only an approximation?
When you use count(distinct ...)
, the system estimates how many different values exist in the field. But instead of checking every single row—which could be billions of records—it uses a clever shortcut to save time and resources.
How it works
Our platform uses Elasticsearch under the hood. For counting unique values, it uses a fast estimation method called "cardinality aggregation," powered by an algorithm known as HyperLogLog++.
This algorithm is like a statistical shortcut:
Imagine you're trying to count how many different types of birds visit a huge forest.
Instead of watching every bird one by one, you observe a few areas and estimate based on what you see.
It’s fast, and usually very close to the true number—but it’s not perfect.
The results may not be perfectly accurate
When the dataset contains a very large number of unique values
When values in the field are very random or rare
So why use this method?
It works very quickly, even with large datasets
It saves memory and computing power
It gives a result that is almost always close enough for analysis
Exact if the number of distinct values in the field is lower than 200
Otherwise, the error is always less than 6%
So it's possible that the estimate produced by the function may differ slightly from the true count, but for most practical uses this difference is not a problem.
In conclusion
count(distinct ...)
is useful for counting unique values in a datasetIt works in
select
andorder_by
clausesYou can use
ifnull(...)
to include null valuesResults are an approximation
In most use cases, the result is fast, efficient, and accurate enough