When we speak about digital sustainability, we often talk in terms of decreasing resource utilisation. In these resources we mainly count memory and CPU as drivers to reducing the electricity usage. However, one aspect that often gets neglected is the data and storage usage.
Writing to/reading from storage also uses certain amount of electricity, which gets under the radar, hidden behind the all-time favourite memory and nowadays hot-topic CPU/GPU resources. One of the reasons why is because the electricity amount used is not as massive as in CPUs and GPUs. Even though not massive, it's still there, and important!
Having this in mind, led me to the realisation - OMG, we did this storage optimisation on the project I'm currently working on, by introducing compression on our database tables! And it was quite a big decrease, which I'll describe later on.
So, why don't I write what I've learned from that matter? Yeah, why not. It will also be good to revisit my knowledge about the basics of TimescaleDB.
This is how we did it - the compressed version. Pun intended.
Setting the stage
As a database solution on our project, we use tried and tested PostgreSQL. Now, without going too much into details about the data being stored there, running PostgreSQL for this kind of data we have (which are mostly metrics) is not optimal. Without wanting to do too much rework, data restructuring, schema refinement, and move to a more appropriate solution, we decided - let's try TimescaleDB on top of PostgreSQL.
This is where simplicity threw its arms in the air and went out of the room.
All jokes aside, we did decrease performance hit quite a bit by using TimescaleDB, even though its another layer of complexity on top of the existing one(s). But more on that later.
What is TimescaleDB?
As usual, let's start from the beginning.
In short - TimescaleDB is an open-source PostgreSQL extension that enables the time-series functionality on PostgreSQL database. It is based on PostgreSQL and has a full SQL support.
If you want your tables in PostgreSQL to be automatically partitioned, cleaned up, compressed, or aggregated - TimescaleDB is a way to go!
On our project, the first reason why we started investigating TimescaleDB was complex data retention scripts on PostgreSQL, then we realised it's much more than that!
What are hypertables?
The main feature of TimescaleDB are hypertables. These tables are PostgreSQL tables that automatically partition time-series data by time and optionally other dimensions. When running a query against a hypertable, TimescaleDB identifies the correct partition (in TimescaleDB dictionary - a chunk) and runs a query on it, instead of the entire table.
Therefore, hypertables improve performance and enable a better data management - no need for long and trying PostgreSQL scripts!
The real beauty, besides the performance improvements lies in the compression of these tables. When properly done, it can bring down the table size quite significantly, as well as improving the query speed.
If you want to learn more about hypertables follow this link.
Simple things about compression
Enabling compression is quite simple and it consists of two steps:
- Decide on the compress segment on a table.
ALTER TABLE example SET ( timescaledb.compress, timescaledb.compress_segmentby = 'device_id' );- Add a compression policy.
SELECT add_compression_policy('example', INTERVAL '7 days');Not so simple things about the compression
What is not as simple as running the two SQL queries against the DB is deciding on the ordering and segmenting of the data. That is - if you're not that familiar with your data/table structure.
Here is what happens when you enable compression:
When you enable compression, the data in your hypertable is compressed chunk by chunk. When the chunk is compressed, multiple records are grouped into a single row. The columns of this row hold an array-like structure that stores all the data. This means that instead of using lots of rows to store the data, it stores the same data in a single row. Because a single row takes up less disk space than many rows, it decreases the amount of disk space required, and can also speed up your queries.
The ordering and segmenting here is important because it will have a great impact on the compression ratio and performance on the queries against the hypertables.
Segmenting the compressed data should be based on the way you access the data. Basically, you want to segment your data in such a way that you can make it easier for your queries to fetch the right data at the right time. That is to say, your queries should dictate how you segment the data so they can be optimised and yield even better query performance.
Great document that explains this into detail and which I have referenced a lot here, can be found here
Our experience
We started having a look into compression when the DBAs reached to us and told us - these tables are too big for the current DB instance, let's explore and use compression on those tables.
Our first thought was - wouldn't that add additional performance toll on the DB host? Maybe. Let's test it out and see.
Our second thought - okay, let's see the tables we need to compress. We selected 4 of the biggest ones and started our analysis.
For reference purposes, I'll use an example table, which is a hypertable with 90d of retention and more than 1TB in size. I'll call this table creatively enough table-0.
This table had the following columns:
- id (int4),
- sub_id (int4),
- value (float4),
- name (varchar255),
- sub_name (varchar255),
- status (int2) and
- type (int2).
Setting the proper segmentation
In the beginning we struggled setting the proper segmentation. We noticed there was a significant difference in the size-to-performance ratio if we decided to segment on the more granular, but not that much queried data, compared to the less granular, but frequently accessed data.
If we chose the first option, with more compression - performance degraded. The second option was less compression, but better performance.
Wrong column type
Another problem we experienced was that certain queries were failing with the error below:
SQL Error [XX000]: ERROR: a variable with non-vectorizable type character varying is marked as vectorized
Detail: Assertion 'is_vector_type(var->vartype)' failed.Although not explicitly written in the documentation, going through the GitHub issues we've discovered that the segmentation by varchar column type is not a recommended solution, and we should change the type to text instead, because of the way how TimescaleDB does the compression.
in some places its bad practice for postgres too. You should use text instead.
We found info about this here, here, and here.
What are the results we achieved?
The biggest improvement we made was that the underlying hypertable shrinked in size from more than 1TB to ~200GB! We were amazed!
The second improvement was the performance boost we didn't think was possible. But it was, because of the way how TimescaleDB implements compression. We noticed this improvement in the query speed and as well as resource consumption, specifically CPU.
In future I might prepare some case study, where I can show you actual results, or close to actual results of implementing compression on big hypertables. At this point, I'm not able to share more.
How all this ties to sustainability?
It was in the title, no? Jokes aside, making systems more efficient with less resources, and less storage, is the goal of any sustainable improvement and design. With this change, we did release quite big amount of storage and more than a couple of CPU cores. I would count that as a win!
In case you're interested to find out more about the compression, check out the links I've shared above, or this one, that relates to ways how to design the tables for compression.
See you in the next article!