How to count unique values in Cassandra databases over a given time frame?

Do you want to get a distinct count of records in Apache Cassandra over a time range? Well tough luck, unlike traditional SQL databases, Cassandra does not support COUNT(DISTINCT ...) queries efficiently, especially across time ranges.


What you want to do is use a command such as follows from within your Cassandra database server and export the output to a file.

cqlsh -u <username> -p '<password>' -e "SELECT id_column FROM <database_name>.<table_name> where timestamp_column > timestamp AND timestamp_column < timestamp ALLOW FILTERING" > id_date.txt


Then you can use a linux command such as the following to sort out the unique id columns and remove duplicates

grep -Ev -- '---|id_column|^$' id_date.txt | awk -F'|' '{print $1}' | tr -d ' ' | sort -u > unique_ids.txt

  • grep -Ev → removes headers, separators, and empty lines
  • awk → extracts the first column
  • tr -d ' ' → removes unwanted spaces
  • sort -u → sorts and removes duplicates (your “distinct count” replacement)
  • You can then count those unique columns with a command like this:

    cat unique_ids.txt | wc -l

    And there you have it... A DISTINCT count of the id column over your given time frame!

    0 Comments