Last updated: October 3 2024
A data warehouse is a database that has been optimized for long running analytical queries. This is in contrast to transactional databases, which are optimized for fast, transactional queries.
There are two main types of relational databases, OLTP and OLAP. OLTP stands for “online transaction processing”, and OLAP stands for “online analytical processing”. Data warehouses are OLAP databases.
The basic difference between these two kinds of systems is that OLTP databases write their data row-wise and OLAP databases write their data column wise. Databases that write their data in a row-wise fashion can update small amounts of data very quickly, whereas columnar databases are optimized for long running queries on huge amounts of data.
A relatively new trend is databases being able to write tables in a row oriented or column oriented fashion. This is sometimes called called Hybrid transactional and analytical processing, or HTAP. I predict that HTAP will be the dominant way to store relational data sometime in the near future.
I split data warehouses into two tiers:
AWS, Azure, and GCP all offer a modern data warehouse service (Redshift, Synapse, and Bigquery, respectively). Snowflake is available on all 3 of those cloud services, as well.
All 4 of the services mentioned above are good options. You can’t go wrong with any of them, so you should choose one that is available in the cloud you already use.
Snowflake is the most popular data warehouse in the world for good reason. It’s fast, multi cloud, has great UX/UI, awesome docs, and is backed up by a top notch sales and support team.
Anyone who needs a data warehouse should consider it.
Bigquery offers similar performance and features to Snowflake, but isn’t multi cloud, and has a fundamentally different pricing model.
I would consider using Bigquery over Snowflake if I was already using GCP as my main cloud vendor and an analysis of my use case predicts significant cost savings.
Synapse, like Bigquery, offers similar performance and features to Snowflake, but isn’t multi cloud. It does, however, have my favorite pricing model out of the top data warehouse services.
I would consider using Synapse over Snowflake if I was already using Azure as my main cloud vendor and an analysis of my use case predicts significant cost savings.
There are two versions of Redshift: Serverless and provisioned. No one should adopt the provisioned version as it is not a modern design pattern and may limit you in the future.
The serverless version is a solid piece of tech and has fantastic integration with other AWS services, but costs over $2,000 a month up front. That’s before you even start paying for compute and storage resources!
Because of this, only large orgs who are already using AWS as their main cloud vendor should consider Redshift.
The rest of these are not modern data warehouses that separate storage from compute. Rather, they are just databases that happen to be able to read and write data in a columnar fashion.
You should only use these if you know that your data warehousing needs will not grow very large and your team already has some working knowledge of the system in question.
I would avoid MariaDB in particular, because the other 3 all have some compatibility with at least one of the modern data warehouses above. For example, SQL Server is compatible with Synapse, and AlloyDB and Citus are (somewhat) compatible with Redshift. This means you could migrate to those data warehouses with relatively less effort.