What if I told you I could save you $40,000 a day? Would you think Iām crazy?
$40,000 may be an astronomical number, but with what some companies are spending on cloud data warehouse egress costs, the order of magnitude perhaps isnāt so far out of the realm of reality.
Regardless of the dollar amount you assign to the problem, the point is this: if you donāt take care to craft your data pipelines, youāre going to overspend. $40K a day is extreme, but bad data management has a cost. A chorus of data engineers and data practitioners including people like Chad Sanderson and Lauren Balik remind us of this almost daily.
How you build analytical queries - and specifically how you manage the amount of data that they process to get the answers you want - can have a huge impact on your cloud bill. Itās critically important that you have robust observability measures to alert you to requests consuming excess data - and the ātaxā youāre paying to run them.
At Tinybird we always say to work backwards. Build a functional endpoint, then work hard and fast to optimize it so it processes as little data as possible when itās queried in production.
So here is a real - albeit extreme - example of putting that to work, and how it impacts costs.
If video is more your thing, you can watch my talk on this subject right below:
The data
Consider an example data source: An audit log containing 258M rows and nearly 14 GB of product usage data sorted and indexed by a company_id
. This is a very realistic data set: If my product had 10,000 daily active users generating 100 events per day, Iād easily hit that size in less than a year.
I want to know how many events of a particular type occurred during a particular time period, grouped by country. Seems pretty simple, right?
Hereās a diagram of how I might go about that:
I start with the events data, join it to a table containing country data on an author_id
, aggregate, then filter to the timestamps and event types I want. Easy peasy, and hereās how that looks in a Tinybird Pipe (a chain of nodes containing iterative SQL queries):
But hold on a minute. If you took the time to read through those queries, or if youāre a savvy data engineer, you probably already spotted the problems. This approach is really inefficient. In fact, this query scans 6.67 GB every time itās called and takes 2.41 s to complete. Yikes.
If I built an application to visualize this result and refreshed it every second, I would spend $40,340.16/day on Tinybirdās Pro-plan. š³
How do we fix it?
Whether youāre a data engineer or not, there are some fundamental rules you can follow to not only spend significantly less but also improve the performance of your queries. Weāve described them in detail in one of our guides, and Iāll list them here:
- Rule ā 1 ā¶ The best data is the data you donāt write.
- Rule ā 2 ā¶ The second best data is the data you donāt read.
- Rule ā 3 ā¶ Sequential reads are 100x faster.
- Rule ā 4 ā¶ The less data you process after reading, the better.
- Rule ā 5 ā¶ Do complex operations later in the processing pipeline.
My original query breaks all five of these rules. Whoops. So hereās how I fixed it:
Filter first, join later
The first and simplest fix is to apply filters first and do the join later. By doing so, I speed up my query by following rule 1. Instead of writing a massive join to memory, Iām filtering first, so that subsequent results are smaller.
While that speed boost should deliver a better frontend UX, I still have a problem: The amount of data I read to complete the query hasnāt changed. Iām still spending roughly the equivalent of a 2021 Tesla Model Y every day.
What to do? Letās take a look at the schema.
Indices and data types
If you peek at the schema, youāll see a few problems.
First, Iām using expensive String
data types where a LowCardinality(String)
is more appropriate, because columns like event
, device_OS
, and device_browser
have only a limited number of options. We can grab back some memory just by assigning appropriate data types.
Second, my sorting keys are incomplete based on the filters Iām trying to apply.
Because Iām filtering on event
and datetime
columns, I should add them to the sorting keys.
By making these changes to the schema of my data source, Iām following rules 2 and 3: Reading less data - and sequential data - to filter, since now my filters are applied to indexed columns.
Just making these simple changes has massive implications on cost, because now instead of reading every row in the table, the query can target the granules in the table that fit within my event
and datetime
filters.
The resulting query now reads only 76 MB in less than 400 ms, Iāve cut costs by 100X to just $400 a day, and Iāve given the frontend UX a nice boost with a much lower response latency.
Materialized views: the ace up our sleeve
So Iāve just saved $39,600 and itād be easy to call it a day. But thereās one more thing I can do to squeeze as much as I can out of this query.
In Tinybird, materialized views can easily be created using Pipes. Instead of terminating a Pipe in an API endpoint, you can materialize the results, and reap the benefits later. Materialized views update on ingestion in Tinybird, so theyāre always up to date.
The best part? Thereās no processing cost to incrementally update the materialized views, and the storage cost is negligible (in this case less than one-thousandth of a cent).
To improve query performance even more, Iāll materialize the join and aggregation first. Instead of calling the result below through an API endpoint, I materialize it into a new data source called join_data_mv.
Now, I can query that materialized view instead of the original data source:
The final result scans only 333 kB in about 1 ms.
I got the same result as my first query, but with a blazing fast frontend and a bill of just $18/day.
Obviously, compared to the reduction that I got in our third optimization, from $40K to $400, this might seem like only an incremental gain. But this isnāt just about what I saved on this query. Itās about what Iāll save on this query and every query I build after it. By implementing simple best practices, and using materialized views, I can not only justify my value as a data practitioner but also have a real shot at generating a meaningful ROI from this analysis.
Do this enough, and maybe your company will just go ahead and buy you a Tesla to say thanks.
ā