Basic time-series operations
ClickHouse provides several methods for working with time series data, allowing you to aggregate, group, and analyze data points across different time periods. This section covers the fundamental operations commonly used when working with time-based data.
Common operations include grouping data by time intervals, handling gaps in time series data, and calculating changes between time periods. These operations can be performed using standard SQL syntax combined with ClickHouse's built-in time functions.
We’re going to explore ClickHouse time-series querying capabilities with the Wikistat (Wikipedia pageviews data) dataset:
Let’s populate this table with 1 billion records:
Aggregating by time bucket
The most popular requirement is to aggregate data based on periods, e.g. get the total amount of hits for each day:
We’ve used the toDate()
function here, which converts the specified time to a date type. Alternatively, we can batch by an hour and filter on the specific date:
The toStartOfHour()
function used here converts the given time to the start of the hour.
You can also group by year, quarter, month, or day.
Custom grouping intervals
We can even arbitrary intervals, e.g., 5 minutes using the toStartOfInterval()
function.
Let’s say we want to group by 4-hour intervals.
We can specify the grouping interval using the INTERVAL
clause:
Or we can use the toIntervalHour()
function
Either way, we get the following results:
Filling empty groups
In a lot of cases we deal with sparse data with some absent intervals. This results in empty buckets. Let’s take the following example where we group data by 1-hour intervals. This will out the following stats with some hours missing values:
ClickHouse provides the WITH FILL
modifier to address this. This will fill out all the empty hours with zeros, so we can better understand the distribution over time:
Rolling time windows
Sometimes, we don’t want to deal with the start of intervals (like the start of the day or an hour) but window intervals. Let’s say we want to understand the total hits for a window, not based on days but on a 24-hour period offset from 6 pm.
We can use the date_diff()
function to calculate the difference between a reference time and each record’s time.
In this case, the day
column will represent the difference in days (e.g., 1 day ago, 2 days ago, etc.):