Do you want to get a distinct count of records in Apache Cassandra over a time range? Well tough luck, unlike traditional SQL databases, Cassandra does not support COUNT(DISTINCT ...) queries efficiently, especially across time ranges.
What you want to do is use a command such as follows from
within your Cassandra database server and export the output to a file.
cqlsh -u <username> -p '<password>' -e "SELECT id_column FROM <database_name>.<table_name> where timestamp_column > timestamp AND timestamp_column < timestamp ALLOW FILTERING" > id_date.txt
Then you can use a linux command such as the following to sort out the unique id columns and remove duplicates
grep -Ev -- '---|id_column|^$' id_date.txt | awk -F'|' '{print $1}' | tr -d ' ' | sort -u > unique_ids.txt
grep -Ev→ removes headers, separators, and empty linesawk→ extracts the first columntr -d ' '→ removes unwanted spacessort -u→ sorts and removes duplicates (your “distinct count” replacement)
You can then count those unique columns with a command like this:
cat unique_ids.txt | wc -l
And there you have it... A DISTINCT count of the id column over your given time frame!
0 Comments