Often times in my work I come across queries like this,
SELECT SUM(count) FROM data WHERE foo = 'bar' AND time >= '2018-08-13 00:00:00+00' AND time < '2018-09-13 00:00:00+00'
which is getting a total count for a month with a filter, and they
take a long time to run because there’s a significant number
of rows in the
data table but only a few have
foo = 'bar'.
We want these queries to execute faster, which is the same as reducing their latency. Remember that latency comes from doing work or waiting. This post is about modifying the work aspect because analytics queries don’t really have problems with waiting (at least in my experience, but there are exceptions). There are two ways to improve how work gets done to speed up your queries:
This post is about #2. These approaches aren’t mutually exclusive so you can always try to apply #1 after the techniques I describe below. I’m also putting this post in a relational database context but these techniques apply generally. There are many other techniques I can talk about but these are the ones that stand out the most from what I learned in the past few years.
Required code changes: None
The first approach to making this query faster is familiar to those used to
SQL databases. Adding an index on
foo and using it for your query
lets you do less work by skipping to only the data you’re interested in.
If you’re using a relational database, you probably won’t need any code changes. You can just add a new index to a table and immediately get benefits. Keep in mind that indexing has downsides like increased space usage and more work required for writes because you’re making a significant change to how data is stored.
Required code changes: Minimal
This approach is less generalized but has been super useful in my work.
You can try to store extra metadata about your data that you can reference
later to reduce the amount of data you need to consider. For example, if you
can store the first time
foo = 'bar', you can use that information to truncate
time ranges in your queries to begin at that time without affecting the result.
I say this has minimal code change requirements since I find that we’re often storing this information anyway, and we just have to start looking it up for other queries.
Required code changes: Significant
The last approach is precomputation. Looking at my query, notice how my time range is a month. If the older data isn’t being updated often, I can just precompute those counts in, for example, 1 day aggregates. That way I can do the work ahead of time and use the results in the future.
I say this has significant code change requirements because it can be tricky to create rollup processes and implement the logic to handle multiple granularities.