Spark Review

Last updated: October 3 2024

Spark is a popular and versatile tool for data movement and processing. It even does a pretty good job acting as a data warehouse using a pattern called a data lakehouse.

Spark itself is open-source and many organizations self-manage the open-source version. However, the people who originally authored the software started a company called Databricks which has become almost synonymous with the project.

For me, the primary use case for Spark is extracting semi-structured or unstructured data, then processing and dumping it into object storage (such as S3) to create a data lakehouse. It should be noted that this is a key task in creating an AI training dataset.

If your data is already in a relational format (for example, if it came from a database), I’d recommend checking out traditional data warehouses.

Spark, and by extension Databricks, is frequently compared to Snowflake. However, I think their products complement each other, and aren’t really direct competitors.

Pros

Cons