This article is not a guide or tutorial, but an honest story about how we built a data platform on open-source and what came out of it. With all the pitfalls, compromises, and real numbers.
When I joined my current company, the entire analytics ran on Google Sheets and Looker Studio. Managers collected data manually, copied numbers from one spreadsheet to another, and once every two weeks someone produced a "report" — essentially a Looker Studio screenshot for a presentation. Nobody understood where the numbers came from, nobody trusted them 100%, and every "why is this number different?" turned into an hour of back-and-forth.
Sound familiar? I think 80% of mid-sized companies handle analytics exactly like this. And it works — up to a point. By the way, some results are still produced this way — just not at the same scale as before. Our point came when the number of data sources grew and decisions needed to be made faster.
This article is the story of our journey from chaos to a Data Lakehouse, built entirely on open-source tools. No cloud providers, no $50k/year licenses. With all the problems, compromises, and lessons learned.
The Beginning: Google Sheets + Looker Studio
Google Sheets is the perfect tool for getting started. Free, intuitive for everyone, real-time collaboration. Looker Studio (formerly Data Studio) is a free BI system that connects to Google Sheets and shows charts.
Problems started when:
- Data volume grew. Google Sheets has a limit of 10 million cells. When you have 15+ data sources, you hit that limit faster than you'd think.
- Looker Studio was slow. Dashboards pulling data from 5 different sheets took minutes to load. And if someone was editing the source sheet at the same time — everything fell apart.
- Weak versioning. Yes, Google Sheets has change history — but who actually uses it? Someone deletes a row from a sheet, and you notice a week later. Searching through 500 revisions to find who did it and when — that's a special kind of fun.
- Spaghetti formulas. IMPORTRANGE on top of IMPORTRANGE on top of VLOOKUP from another sheet. One change — and a cascade of errors.
It became clear: we needed proper data infrastructure.
First Database: MSSQL
Why MSSQL and not PostgreSQL right away? Simply — experience. At my previous job I worked with SQL Server, knew its strengths and weaknesses, could set it up quickly.
Before having a proper database, getting data exports could take weeks. Literally: "can you pull this data?" — "create a task, I'll pick it up." Then revisions, then more revisions, and the carousel kept spinning. When MSSQL appeared, the time from request to answer shrank to hours, then to minutes.
That's also when Airbyte appeared — for transferring data from other databases into MSSQL. It worked well enough until a certain point, when volumes and the number of sources outgrew it.
MSSQL vs PostgreSQL:
| Criterion | MSSQL | PostgreSQL |
|---|---|---|
| License | Express free (10 GB limit), full — paid | Fully free |
| Ecosystem | Windows-centric, SSMS | Cross-platform, pgAdmin, DBeaver |
| JSON support | Basic | Native, JSONB |
| Replication | AlwaysOn (Enterprise) | Built-in, streaming |
| Community | Smaller, corporate | Huge, active |
MSSQL worked well at the start, but over time it became clear that PostgreSQL was a better fit — free, open, with better JSON support and a wider ecosystem.
PostgreSQL: The Workhorse
Then we set up PostgreSQL. Now Airbyte writes there, and whenever we need classic relational data — we write there too. PG became our default database for structured data.
PostgreSQL vs MySQL:
| Criterion | PostgreSQL | MySQL |
|---|---|---|
| SQL standards | Strict compliance | Loose interpretation |
| Complex queries | CTEs, window functions, LATERAL | Basic CTE support |
| Extensibility | Huge (PostGIS, pgvector, ...) | Limited |
| Replication | Logical + physical | Row-based, Group Replication |
| For analytics | Good | Average |
MySQL (more precisely MariaDB) is also in our stack — it powers the company's production services. But for new analytical tasks we always choose PG.
MongoDB: For Specific Use Cases
MongoDB appeared after we already had Dremio running — for specific cases: parsing Facebook Graph API, social media data, where response structure can change with every request. When an API returns nested JSON with arbitrary depth — a relational database starts suffering. MongoDB feels right at home here.
Since Dremio was already in place, the overhead of yet another source was minimal — we connected MongoDB as a source, and all databases (MSSQL, PostgreSQL, MongoDB) get the same SQL queries through Dremio. One language, one interface.
MongoDB vs PostgreSQL (JSONB):
| Criterion | MongoDB | PostgreSQL JSONB |
|---|---|---|
| Schema flexibility | Full, schemaless | JSON within relational framework |
| Nested documents | Native | Requires JSONB operators |
| Aggregations | Aggregation Pipeline | SQL |
| Scaling | Sharding out of the box | Citus, partitioning |
| For API parsing | Ideal | Acceptable |
Data Lakehouse: Why and What
So we already had several databases: MSSQL, PostgreSQL, MongoDB, plus data in Google Sheets, files, APIs. Classic chaos. Each source — its own format, its own access, its own problems.
Data Lakehouse is an architecture that combines the advantages of a Data Lake (storing raw data cheaply and at scale) with a Data Warehouse (SQL queries, ACID transactions, schemas). And we decided to build it entirely on open-source.
Our stack:
- MinIO — object storage, S3 API compatible. Our Data Lake
- Apache Iceberg — table format with ACID support, time travel, schema evolution
- Nessie — data catalog with Git-like versioning
- Dremio — SQL engine for querying all of this
MinIO: Your Own S3
MinIO is essentially self-hosted Amazon S3. Object storage with an S3-compatible API. Everything that works with S3 — works with MinIO.
MinIO vs Amazon S3 vs Azure Blob:
| Criterion | MinIO | Amazon S3 | Azure Blob |
|---|---|---|---|
| Cost | Free (self-hosted) | Pay-per-use | Pay-per-use |
| Control | Full | Limited | Limited |
| S3 API | Compatible | Original | Own API + S3 adaptor |
| Scaling | Manual | Automatic | Automatic |
| For us | Ideal | Too expensive for our volumes | Not needed |
For our case, MinIO is the ideal choice. We control the hardware, no traffic bills, and everything works exactly as if we were on AWS.
Apache Iceberg: Tables That Don't Break
Iceberg is a table format on top of object storage. Essentially, it lets you work with files in MinIO as if they were database tables. With ACID transactions, versioning, and schema evolution.
Iceberg vs Delta Lake vs Apache Hudi:
| Criterion | Iceberg | Delta Lake | Hudi |
|---|---|---|---|
| Founded by | Netflix → Apache | Databricks | Uber → Apache |
| Vendor lock-in | None | Tight with Databricks | Less |
| Engine support | Spark, Flink, Dremio, Trino | Spark, Databricks | Spark, Flink |
| Time Travel | Yes | Yes | Yes |
| Schema Evolution | Excellent | Good | Good |
| Partitioning | Hidden partitioning | Explicit | Explicit |
| Community | Growing fast | Largest | Medium |
Iceberg won because it's the most vendor-neutral. We don't want to be tied to Databricks or anyone else.
Nessie: Git for Data
Nessie is a data catalog that works like Git. You can create branches, make commits, merge changes. For data. It sounds like science fiction, but it actually works.
Nessie vs Hive Metastore vs AWS Glue vs LakeFS:
| Criterion | Nessie | Hive Metastore | AWS Glue | LakeFS |
|---|---|---|---|---|
| Versioning | Git-like (branches, commits) | No | No | Git-like (branches, commits) |
| Self-hosted | Yes | Yes | No (AWS only) | Yes |
| Iceberg support | Native | Via plugin | Yes | Via S3 gateway |
| Complexity | Low | High (Hadoop) | Medium | Low |
| Cost | Free | Free | Pay-per-use | Free (OSS) |
| Dremio integration | Built-in, native | No | No | No |
Why Nessie and not LakeFS? LakeFS is a great tool that versions data at the object storage level (like Git for files in S3). But Nessie works at the catalog level — it versions table metadata, not the files themselves. For Iceberg tables this is more efficient. Plus, Nessie is the default catalog in all official Dremio tutorials. So integration works out of the box, no fuss.
Nessie is what holds our infrastructure together. If the data in Nessie is intact — everything else can be restored.
Dremio: The SQL Engine That Unites Everything
Dremio is the heart of our Lakehouse. It's a SQL engine that connects to all our sources: MinIO via Iceberg/Nessie, PostgreSQL, MSSQL, MongoDB — and lets you run SQL queries against all of it as if it were one database.
Dremio vs Trino vs Apache Spark SQL vs Databricks:
| Criterion | Dremio | Trino (Starburst) | Spark SQL | Databricks |
|---|---|---|---|---|
| UI | Convenient web interface | Minimal | No | Full-featured |
| Iceberg support | Native, excellent | Native (excellent) | Via connector | Native (Delta) |
| Nessie integration | Built-in | Plugin | Plugin | No |
| Self-hosted | Community Edition | Yes | Yes | No (SaaS) |
| Optimization | Reflections, Columnar Cloud Cache | Caching | RDD cache | Photon |
| Simplicity | High | Medium | Low | High |
| Role model | Basic (all admins in CE) | Full | Via Ranger | Full |
90% of our tables in Dremio are views. We deliberately minimize data copying. Instead of duplicating data into separate tables, we create views on top of sources, and for acceleration — we add Reflections (materialized aggregates that Dremio creates and refreshes automatically). This lets us keep physical copies to a minimum: aggregated data is stored as Reflections, non-aggregated data — as views with different representations. Queries against views execute fast thanks to Reflections, and data is always fresh because views read directly from the source.
The main problem with Dremio Community Edition — and this applies to many OSS products — is the lack of proper role management. In Community Edition all users are admins. You can't restrict access to specific datasets or grant read-only permissions. For a small team it's tolerable, but as you grow — it hurts.
For now, access to Dremio in our setup is managed on trust — the team is small, everyone understands where they can and can't go. We've partially solved this problem through our MCP gateway, which I'll describe below — it adds an ACL layer on top of Dremio and controls who has access to what. But for direct connections to Dremio the problem remains — this is one of those OSS compromises: the functionality is there, but access control — build it yourself.
This is, by the way, a typical OSS problem: the free version gives you 80% of functionality, but that critical 20% (security, RBAC, audit log) — is in the paid version. And you either pay, or live with compromises — or build workarounds.
Case: The Dremio Upgrade That Went Wrong
One day we decided to upgrade Dremio to a new version. Dremio runs in Docker Compose for us, so the process was supposed to be simple: change the image tag, docker compose up -d, done. But something went wrong — after starting, Dremio couldn't see half the datasets. Internal metadata got corrupted during migration. The logs were a wall of errors with no clear description of the problem. What exactly broke — unclear. We spent time debugging, but eventually decided to wipe everything and spin up a clean installation.
Moment of panic? Yes, for about thirty seconds. Then — "wait, we have Nessie." All our Iceberg tables, their schemas, change history — everything is stored in Nessie, not in Dremio. Dremio is just the query engine. We spun up a clean Dremio, reconnected the Nessie catalog, and all data came back. And since all our views were defined in dbt — one dbt run command, and all views were redeployed. Within an hour everything was working as before.
That was the moment when the architectural decision to separate compute (Dremio) and catalog (Nessie) paid for itself 200%. If we had relied on Dremio's internal catalog — recovery would have taken days, not an hour.
Case: DELETE Without WHERE, or How Nessie Saved Us in 2 Seconds
There's a classic joke about DELETE FROM table without a WHERE clause. Well, for us it wasn't a joke.
One time I ran a DELETE query on an Iceberg table and forgot to add a condition. Just DELETE FROM ... — and that's it. The table was wiped clean instantly. My heart stopped for a second, hands went cold — that table had weeks of collected data.
If this were a regular PostgreSQL or MSSQL — it would have been game over. Backup? Maybe yesterday's. But today's data — gone. Best case — hours of restoring from backup, checking integrity, manually backfilling fresh data.
But we have Nessie. And Nessie is Git for data. Every change is a commit. Every commit has a hash. I simply rolled back the branch to the previous commit — and within 2 seconds all the data was back. As if nothing had happened.
Two. Seconds.
By the way, the exact same thing happened with MSSQL — DELETE without WHERE on a production table. Only there was no Nessie. No fresh backup either. Had to restore everything from scratch: bring up an old backup, compare it with production, manually backfill the data that had accumulated since. That took not seconds, but hours. And enough stress for a month ahead.
That's the moment you understand why time travel and data versioning aren't marketing buzzwords but real insurance against human error. We all make mistakes. The only question is how long recovery takes. With Nessie the answer is — seconds. With MSSQL — hours and grey hair.
Airbyte: Data Ingestion
Airbyte is our tool for pulling data from various sources. Currently it pulls tables from Google Sheets and writes to PostgreSQL. Sometimes we use the no-code connector builder — when you can't be bothered writing an asset in Dagster and need to quickly connect a new source.
Airbyte OSS vs Fivetran vs Meltano vs dlt:
| Criterion | Airbyte OSS | Fivetran | Meltano | dlt |
|---|---|---|---|---|
| License | Open-source | SaaS, paid | Open-source | Open-source |
| Connectors | 300+ | 400+ | 300+ (Singer) | Library |
| UI | Web interface | Web interface | CLI | Code |
| No-code builder | Yes | No | No | No |
| Self-hosted | Yes | No | Yes | Yes (it's a library) |
Airbyte OSS problems we've encountered:
- Resource-hungry. Right now Airbyte eats 12 GB of RAM and does essentially nothing. We're planning to drop it. It struggles with large data volumes — a sync can run for days without throwing a single error. Just quietly chugging along.
- Unstable connectors. Some community connectors break after updates. Google Sheets sync drops from time to time for no apparent reason and needs to be restarted.
- Difficult debugging. When something fails — logs are scattered across a dozen Docker containers. Finding what exactly went wrong sometimes takes longer than the fix itself.
- Updates. Every Airbyte update is a lottery. It might go smoothly, or it might break connectors that worked for months.
But Airbyte is still the best option for us, because Fivetran costs so much that it's cheaper to hire another data engineer.
Dagster: Orchestration
Dagster is our orchestrator. It defines what runs, when, and in what order. Every pipeline is a set of assets that Dagster tracks, runs on schedule or by trigger.
Dagster vs Apache Airflow vs Prefect vs Mage:
| Criterion | Dagster | Airflow | Prefect | Mage |
|---|---|---|---|---|
| Concept | Software-defined assets | DAGs of tasks | Flows & tasks | Blocks & pipelines |
| UI | Modern, convenient | Functional, dated | Cloud-based | Notebook-style |
| Testing | Built-in | Difficult | Medium | Basic |
| Typing | Strong (IO managers, types) | Weak | Medium | Weak |
| dbt integration | Native (dagster-dbt) | Via operators | Via blocks | Basic |
| Complexity | Medium | High | Low | Low |
Why Dagster and not Airflow? Because Airflow is 2015. DAGs, operators, XCom — it all works, but feels like legacy. Dagster is built around the concept of assets — you describe not "what to run" but "what data to produce." It fundamentally changes the approach.
I want to specifically highlight Dagster's dependency injection mechanism — it's really well done and very intuitive. It somehow reminds me of university days and Dagger 2 for Android, where everything was maximally confusing. Here — everything is immediately clear, everything makes sense.
Another Dagster plus — native support for running Airbyte, Airflow, and other tools. So if you need to migrate from another orchestrator or integrate an external tool — it's dead simple, no hacks needed.
dbt: Transformation
dbt handles transformations. SQL models that turn raw data into analytical tables. dbt works with Dremio through an adapter, letting us build models on top of Iceberg tables.
Important nuance: right now Dagster and dbt are not connected. dbt runs on push via GitLab CI/CD pipeline — you update a SQL model, push, and GitLab runs dbt run against Dremio. It works, and there's no urgent need to integrate dbt into Dagster. But we're planning to — it's the logical next step toward having a single orchestration point.
As of today, our dbt project consists of 445 models, 848 tests, 103 sources, and 652 macros. These aren't just SQL files — it's a full-fledged codebase with versioning, code review, and CI/CD.
A separate bonus: dbt generates documentation that we push to the Dremio wiki with a link to dbt docs.
MCP Gateway: AI Client for Dremio
One of the most interesting things in our stack is an MCP server (Model Context Protocol) connected to Dremio. Essentially, it's a gateway through which AI assistants (Claude, Cursor, any MCP client) can access our Lakehouse — run SQL queries, read table metadata, browse dbt documentation.
But the main feature isn't that. The main feature is access control. As I mentioned, Dremio Community Edition lacks proper role management — all users are essentially admins. The MCP gateway solves this problem: it has its own ACL layer that controls who has access to which datasets. Every request from an AI client goes through a permissions check — and if you don't have access to a certain space or table, you simply won't see it.
In other words, what Dremio itself lacks, we built at the gateway level. This allows us to safely grant data access to different people and AI agents — everyone sees only what they're permitted to.
dbt vs Dataform vs SQLMesh:
| Criterion | dbt | Dataform | SQLMesh |
|---|---|---|---|
| Community | Huge | Smaller (Google) | Growing |
| Adapters | 30+ | BigQuery-centric | 10+ |
| Versioning | Git | Git | Git + automatic |
| Testing | Built-in | Built-in | Extended |
| Incremental | Standard | Standard | Advanced |
| Price | Core is free | Free (GCP) | Free |
Apache Superset: BI for Everyone
We replaced Looker Studio with Apache Superset. It's a BI platform originally created at Airbnb and donated to the Apache Software Foundation. It connects to any SQL database — including Dremio.
Superset vs Metabase vs Looker Studio vs Grafana:
| Criterion | Superset | Metabase | Looker Studio | Grafana |
|---|---|---|---|---|
| License | Apache 2.0 | AGPL / Commercial | Free (Google) | AGPL |
| SQL Lab | Full-featured | Yes, simplified | No | Explore |
| Visualizations | 50+ types | 15+ | 30+ | For metrics |
| Self-hosted | Yes | Yes | No | Yes |
| Customization | High | Medium | Low | High |
| Role model | Full (RBAC) | Basic | Google accounts | Org/Team |
| For analytics | Ideal | Good | Basic | For monitoring |
Superset is tedious to set up — the initial setup is a quest of its own, the UI glitches sometimes, some charts require acrobatics, and the documentation... let's say it exists. But once everything is configured — it's great. For a free product — it's the best BI option we've found.
The OSS Problem: Where "Free" Ends
The biggest myth about open-source — "it's free." Yes, licenses are free. But:
- Setup time. Every tool needs to be deployed, configured, integrated. That's weeks of work.
- No support. When something breaks — you're on your own. GitHub Issues, Stack Overflow, Discord — and pray someone answers.
- Role management. As I mentioned, in Dremio CE everyone's an admin. Airbyte OSS isn't great with permissions either. Superset has proper RBAC, but configuring it is a quest in itself.
- Updates. Every update of every component is a potential breaking change. And nobody guarantees that a new major Airbyte version will be compatible with your config — we felt this firsthand when upgrading to 2.0.
But at the same time — we fully control our infrastructure. No vendor lock-in. No surprise AWS bills. And most importantly — we understand every component of the system, because we set it up ourselves.
Conclusions
Our journey from Google Sheets to Data Lakehouse took about two years. Not because it's hard to do all at once — but because each tool was added as problems and needs arose. It was an organic process with mistakes, rollbacks, and "let's try this instead." But now we have:
- A single point of access to data — Dremio, where all sources are unified
- Data versioning — Nessie, where every change is tracked
- Reliable storage — MinIO + Iceberg, where data is stored in an open format
- Automation — Dagster for orchestration, dbt for transformations via GitLab CI/CD
- BI for everyone — Superset, where managers build dashboards themselves
Is this system perfect? No. There are issues with roles, bugs in OSS products, moments when you just want to pay for SaaS and not think about it. But for our scale and budget — it's the best option we could build.
And the main takeaway: data is not about tools, it's about processes. The coolest stack won't help if people don't understand why it's needed. We started simple — showed how to get answers to questions faster. Everything else followed naturally.