Last updated: October 3 2024
There is no consensus definition of a data lakehouse, but my interpretation of the concept is a program that reads and writes columnar, relational data to an object storage system, such as S3.
In contrast to a traditional data warehouse which has strict controls over data consistency, user access, and many other safety features, data lakehouse tools can read and write data which other systems have direct access to. This makes them powerful, but less safe and reliable. For example, early in my career, I implemented an entire data lakehouse proof of concept with Presto only to find that it had no built-in way of creating users with limited access to certain tables. Spark is the same way.
If you’re working with data that is already relational by nature - from a transactional, relational database for example, I would recommend using a traditional data warehouse. However, if you have some vast stores of unstructured or semi-structured data, such as web logs, scientific data, or something else supported by Spark or Trino, then creating a data lakehouse may represent the best option for you to get some value out of that data.
In my opinion, there are only two data lakehouse technologies worth considering at this time. Spark and Trino (or Presto).
Spark is the most popular and feature-rich data lakehouse technology. There are many vendors offering managed versions of Spark, including Databricks and all major public cloud platforms.
Trino is the next most popular data lakehouse technology and has an interesting take on the concept. It does not manage data directly, but uses connectors to instruct other data systems to read and write data on its behalf.
This “federated query” concept has some interesting implications. Even though Spark is more popular, you should still consider Trino to determine if this concept is valuable enough to sway your decision. Also, while Spark does just fine as an ad-hoc query engine, Trino is geared a bit more towards that specific use case.