My Boring Yet Modern Data Stack
We have a data stack that we have been using for years now. We have used it with medium to large customers, and they have worked very well. The goal has always been simple, stable, composable tools that can be used on the developer's machine and scaled to work with massive data on production. You can self-host them, host them on the cloud, or get managed services based on your need.
Very similar to my web stack. It's called "Boring" not because it's dull but because there are minimal unwanted surprises. So my current stack for data looks like this. This stack is both "Modern" and "Boring."
Table of Contents
Mostly, our OLTP systems. There are also some third-party data from APIs and CSV exports that we ingest as well. Otherwise, it's pretty standard; databases like MariaDB, PostgreSQL, MongoDB, and some SQLite are the primary source of truth.
We mainly use Clickhouse here. Data gets ingested into Clickhouse in various ways. We use Apache Parquet as our data lake format. It compresses well and can be queried fast. We usually store them on some form of S3-compatible file system. I have shown S3 compatible file system as part of the Infra group in the diagram, but it is used in many places.
DuckDB is a relatively new addition to the stack. It's similar to SQLite in the sense, light eight, serverless and stable, but in the OLAP world. We also use it extensively to explore CSVs and Parquet
We write most of our scripts/ETLs etc., in Python. Python is the primary language, but you might also see some bash scripts. Workflows and pipelines (DAGs) are built, run, scheduled, and monitored using Digdag ( Think of it Apache Airflow but simple). Each step of the DAG is run inside an isolated docker environment. We can run it locally or on ECS. All the dags are maintained inside the Git repo as code. Digdag can run any command or script; it's neutral about it. We mostly write stuff in Python, especially if there are no built-in or plugin-supported operators to do the same.
We use Embulk with Digdag if the job is about loading data or converting between different types of data or systems. It also supports some reformatting, renaming, filtering, etc. You will be surprised how much of the data work is just this. It has plugin support for various inputs, outputs, encodings, filtering, etc., so you have most things covered.
For the business front end, we use Metabase. It's been very stable and very business user-friendly. Engineers use DBeaver, which works with all most all databases ( both OLTP and OLAP). We also use Jinja2 templating system to create SQLs and sometimes reports. You will be surprised how some jinja2 templates and CSS can create good-looking reports. LibreOffice Calc is always there to explore the data used by business users and Engineers.
DAGs, Dockerfiles, code, templates, SQLs; everything we generate, including documentation, is maintained in Git. Everyone in the data team uses Linux machines. I have not mentioned many GNU or Linux tools they use on a daily basis. I consider them part of infra at this point and bucket them under Linux. Almost everything is standardized to work with Docker (or podman); for example, every step of a DAG runs inside a docker image. This makes testing and deployment easy. We do use a standard file system, but more often than not, it's some form of S3-compatible bucket. Source of Import/Exports, Storage, Short term storage and sharing etc. are all done using S3.
Currently, our documentation is basic and manual. I want a sound metadata management and cataloging/discovery system like OpenDataDiscovery or OpenMetaData. But everything seems complex. Essentially I want a Metabase equivalent for metadata management and data discovery. I am still evaluating. I will update you if I find a good candidate. Please send me your suggestions.