From Google Docs to Data Lakehouse: How We Built an OSS Data Platform from Scratch

Mykhailo Klimnyk

This article is not a guide or tutorial, but an honest story about how we built a data platform on open-source and what came out of it. With all the pitfalls, compromises, and real numbers.

When I joined my current company, the entire analytics ran on Google Sheets and Looker Studio. Managers collected data manually, copied numbers from one spreadsheet to another, and once every two weeks someone produced a "report" — essentially a Looker Studio screenshot for a presentation. Nobody understood where the numbers came from, nobody trusted them 100%, and every "why is this number different?" turned into an hour of back-and-forth.

Sound familiar? I think 80% of mid-sized companies handle analytics exactly like this. And it works — up to a point. By the way, some results are still produced this way — just not at the same scale as before. Our point came when the number of data sources grew and decisions needed to be made faster.

This article is the story of our journey from chaos to a Data Lakehouse, built entirely on open-source tools. No cloud providers, no $50k/year licenses. With all the problems, compromises, and lessons learned.

The Beginning: Google Sheets + Looker Studio

Google Sheets is the perfect tool for getting started. Free, intuitive for everyone, real-time collaboration. Looker Studio (formerly Data Studio) is a free BI system that connects to Google Sheets and shows charts.

Problems started when:

Data volume grew. Google Sheets has a limit of 10 million cells. When you have 15+ data sources, you hit that limit faster than you'd think.
Looker Studio was slow. Dashboards pulling data from 5 different sheets took minutes to load. And if someone was editing the source sheet at the same time — everything fell apart.
Weak versioning. Yes, Google Sheets has change history — but who actually uses it? Someone deletes a row from a sheet, and you notice a week later. Searching through 500 revisions to find who did it and when — that's a special kind of fun.
Spaghetti formulas. IMPORTRANGE on top of IMPORTRANGE on top of VLOOKUP from another sheet. One change — and a cascade of errors.

It became clear: we needed proper data infrastructure.

First Database: MSSQL

Why MSSQL and not PostgreSQL right away? Simply — experience. At my previous job I worked with SQL Server, knew its strengths and weaknesses, could set it up quickly.

Before having a proper database, getting data exports could take weeks. Literally: "can you pull this data?" — "create a task, I'll pick it up." Then revisions, then more revisions, and the carousel kept spinning. When MSSQL appeared, the time from request to answer shrank to hours, then to minutes.

That's also when Airbyte appeared — for transferring data from other databases into MSSQL. It worked well enough until a certain point, when volumes and the number of sources outgrew it.

MSSQL vs PostgreSQL:

Criterion	MSSQL	PostgreSQL
License	Express free (10 GB limit), full — paid	Fully free
Ecosystem	Windows-centric, SSMS	Cross-platform, pgAdmin, DBeaver
JSON support	Basic	Native, JSONB
Replication	AlwaysOn (Enterprise)	Built-in, streaming
Community	Smaller, corporate	Huge, active

MSSQL worked well at the start, but over time it became clear that PostgreSQL was a better fit — free, open, with better JSON support and a wider ecosystem.

PostgreSQL: The Workhorse

Then we set up PostgreSQL. Now Airbyte writes there, and whenever we need classic relational data — we write there too. PG became our default database for structured data.

PostgreSQL vs MySQL:

Criterion	PostgreSQL	MySQL
SQL standards	Strict compliance	Loose interpretation
Complex queries	CTEs, window functions, LATERAL	Basic CTE support
Extensibility	Huge (PostGIS, pgvector, ...)	Limited
Replication	Logical + physical	Row-based, Group Replication
For analytics	Good	Average

MySQL (more precisely MariaDB) is also in our stack — it powers the company's production services. But for new analytical tasks we always choose PG.

MongoDB: For Specific Use Cases

MongoDB appeared after we already had Dremio running — for specific cases: parsing Facebook Graph API, social media data, where response structure can change with every request. When an API returns nested JSON with arbitrary depth — a relational database starts suffering. MongoDB feels right at home here.

Since Dremio was already in place, the overhead of yet another source was minimal — we connected MongoDB as a source, and all databases (MSSQL, PostgreSQL, MongoDB) get the same SQL queries through Dremio. One language, one interface.

MongoDB vs PostgreSQL (JSONB):

Criterion	MongoDB	PostgreSQL JSONB
Schema flexibility	Full, schemaless	JSON within relational framework
Nested documents	Native	Requires JSONB operators
Aggregations	Aggregation Pipeline	SQL
Scaling	Sharding out of the box	Citus, partitioning
For API parsing	Ideal	Acceptable

Data Lakehouse: Why and What

So we already had several databases: MSSQL, PostgreSQL, MongoDB, plus data in Google Sheets, files, APIs. Classic chaos. Each source — its own format, its own access, its own problems.

Data Lakehouse is an architecture that combines the advantages of a Data Lake (storing raw data cheaply and at scale) with a Data Warehouse (SQL queries, ACID transactions, schemas). And we decided to build it entirely on open-source.

Our stack:

MinIO — object storage, S3 API compatible. Our Data Lake
Apache Iceberg — table format with ACID support, time travel, schema evolution
Nessie — data catalog with Git-like versioning
Dremio — SQL engine for querying all of this

MinIO: Your Own S3

MinIO is essentially self-hosted Amazon S3. Object storage with an S3-compatible API. Everything that works with S3 — works with MinIO.

MinIO vs Amazon S3 vs Azure Blob:

Criterion	MinIO	Amazon S3	Azure Blob
Cost	Free (self-hosted)	Pay-per-use	Pay-per-use
Control	Full	Limited	Limited
S3 API	Compatible	Original	Own API + S3 adaptor
Scaling	Manual	Automatic	Automatic
For us	Ideal	Too expensive for our volumes	Not needed

For our case, MinIO is the ideal choice. We control the hardware, no traffic bills, and everything works exactly as if we were on AWS.

Apache Iceberg: Tables That Don't Break

Iceberg is a table format on top of object storage. Essentially, it lets you work with files in MinIO as if they were database tables. With ACID transactions, versioning, and schema evolution.

Iceberg vs Delta Lake vs Apache Hudi:

Criterion	Iceberg	Delta Lake	Hudi
Founded by	Netflix → Apache	Databricks	Uber → Apache
Vendor lock-in	None	Tight with Databricks	Less
Engine support	Spark, Flink, Dremio, Trino	Spark, Databricks	Spark, Flink
Time Travel	Yes	Yes	Yes
Schema Evolution	Excellent	Good	Good
Partitioning	Hidden partitioning	Explicit	Explicit
Community	Growing fast	Largest	Medium

Iceberg won because it's the most vendor-neutral. We don't want to be tied to Databricks or anyone else.

Nessie: Git for Data

Nessie is a data catalog that works like Git. You can create branches, make commits, merge changes. For data. It sounds like science fiction, but it actually works.

Nessie vs Hive Metastore vs AWS Glue vs LakeFS:

Criterion	Nessie	Hive Metastore	AWS Glue	LakeFS
Versioning	Git-like (branches, commits)	No	No	Git-like (branches, commits)
Self-hosted	Yes	Yes	No (AWS only)	Yes
Iceberg support	Native	Via plugin	Yes	Via S3 gateway
Complexity	Low	High (Hadoop)	Medium	Low
Cost	Free	Free	Pay-per-use	Free (OSS)
Dremio integration	Built-in, native	No	No	No

Why Nessie and not LakeFS? LakeFS is a great tool that versions data at the object storage level (like Git for files in S3). But Nessie works at the catalog level — it versions table metadata, not the files themselves. For Iceberg tables this is more efficient. Plus, Nessie is the default catalog in all official Dremio tutorials. So integration works out of the box, no fuss.

Nessie is what holds our infrastructure together. If the data in Nessie is intact — everything else can be restored.

Dremio: The SQL Engine That Unites Everything

Dremio is the heart of our Lakehouse. It's a SQL engine that connects to all our sources: MinIO via Iceberg/Nessie, PostgreSQL, MSSQL, MongoDB — and lets you run SQL queries against all of it as if it were one database.

Dremio vs Trino vs Apache Spark SQL vs Databricks:

Criterion	Dremio	Trino (Starburst)	Spark SQL	Databricks
UI	Convenient web interface	Minimal	No	Full-featured
Iceberg support	Native, excellent	Native (excellent)	Via connector	Native (Delta)
Nessie integration	Built-in	Plugin	Plugin	No
Self-hosted	Community Edition	Yes	Yes	No (SaaS)
Optimization	Reflections, Columnar Cloud Cache	Caching	RDD cache	Photon
Simplicity	High	Medium	Low	High
Role model	Basic (all admins in CE)	Full	Via Ranger	Full

90% of our tables in Dremio are views. We deliberately minimize data copying. Instead of duplicating data into separate tables, we create views on top of sources, and for acceleration — we add Reflections (materialized aggregates that Dremio creates and refreshes automatically). This lets us keep physical copies to a minimum: aggregated data is stored as Reflections, non-aggregated data — as views with different representations. Queries against views execute fast thanks to Reflections, and data is always fresh because views read directly from the source.

The main problem with Dremio Community Edition — and this applies to many OSS products — is the lack of proper role management. In Community Edition all users are admins. You can't restrict access to specific datasets or grant read-only permissions. For a small team it's tolerable, but as you grow — it hurts.

For now, access to Dremio in our setup is managed on trust — the team is small, everyone understands where they can and can't go. We've partially solved this problem through our MCP gateway, which I'll describe below — it adds an ACL layer on top of Dremio and controls who has access to what. But for direct connections to Dremio the problem remains — this is one of those OSS compromises: the functionality is there, but access control — build it yourself.

This is, by the way, a typical OSS problem: the free version gives you 80% of functionality, but that critical 20% (security, RBAC, audit log) — is in the paid version. And you either pay, or live with compromises — or build workarounds.

Case: The Dremio Upgrade That Went Wrong

One day we decided to upgrade Dremio to a new version. Dremio runs in Docker Compose for us, so the process was supposed to be simple: change the image tag, docker compose up -d, done. But something went wrong — after starting, Dremio couldn't see half the datasets. Internal metadata got corrupted during migration. The logs were a wall of errors with no clear description of the problem. What exactly broke — unclear. We spent time debugging, but eventually decided to wipe everything and spin up a clean installation.

Moment of panic? Yes, for about thirty seconds. Then — "wait, we have Nessie." All our Iceberg tables, their schemas, change history — everything is stored in Nessie, not in Dremio. Dremio is just the query engine. We spun up a clean Dremio, reconnected the Nessie catalog, and all data came back. And since all our views were defined in dbt — one dbt run command, and all views were redeployed. Within an hour everything was working as before.

That was the moment when the architectural decision to separate compute (Dremio) and catalog (Nessie) paid for itself 200%. If we had relied on Dremio's internal catalog — recovery would have taken days, not an hour.

Case: DELETE Without WHERE, or How Nessie Saved Us in 2 Seconds

There's a classic joke about DELETE FROM table without a WHERE clause. Well, for us it wasn't a joke.

One time I ran a DELETE query on an Iceberg table and forgot to add a condition. Just DELETE FROM ... — and that's it. The table was wiped clean instantly. My heart stopped for a second, hands went cold — that table had weeks of collected data.

If this were a regular PostgreSQL or MSSQL — it would have been game over. Backup? Maybe yesterday's. But today's data — gone. Best case — hours of restoring from backup, checking integrity, manually backfilling fresh data.

But we have Nessie. And Nessie is Git for data. Every change is a commit. Every commit has a hash. I simply rolled back the branch to the previous commit — and within 2 seconds all the data was back. As if nothing had happened.

Two. Seconds.

By the way, the exact same thing happened with MSSQL — DELETE without WHERE on a production table. Only there was no Nessie. No fresh backup either. Had to restore everything from scratch: bring up an old backup, compare it with production, manually backfill the data that had accumulated since. That took not seconds, but hours. And enough stress for a month ahead.

That's the moment you understand why time travel and data versioning aren't marketing buzzwords but real insurance against human error. We all make mistakes. The only question is how long recovery takes. With Nessie the answer is — seconds. With MSSQL — hours and grey hair.

Airbyte: Data Ingestion

Airbyte is our tool for pulling data from various sources. Currently it pulls tables from Google Sheets and writes to PostgreSQL. Sometimes we use the no-code connector builder — when you can't be bothered writing an asset in Dagster and need to quickly connect a new source.

Airbyte OSS vs Fivetran vs Meltano vs dlt:

Criterion	Airbyte OSS	Fivetran	Meltano	dlt
License	Open-source	SaaS, paid	Open-source	Open-source
Connectors	300+	400+	300+ (Singer)	Library
UI	Web interface	Web interface	CLI	Code
No-code builder	Yes	No	No	No
Self-hosted	Yes	No	Yes	Yes (it's a library)

Airbyte OSS problems we've encountered:

Resource-hungry. Right now Airbyte eats 12 GB of RAM and does essentially nothing. We're planning to drop it. It struggles with large data volumes — a sync can run for days without throwing a single error. Just quietly chugging along.
Unstable connectors. Some community connectors break after updates. Google Sheets sync drops from time to time for no apparent reason and needs to be restarted.
Difficult debugging. When something fails — logs are scattered across a dozen Docker containers. Finding what exactly went wrong sometimes takes longer than the fix itself.
Updates. Every Airbyte update is a lottery. It might go smoothly, or it might break connectors that worked for months.

But Airbyte is still the best option for us, because Fivetran costs so much that it's cheaper to hire another data engineer.

Dagster: Orchestration

Dagster is our orchestrator. It defines what runs, when, and in what order. Every pipeline is a set of assets that Dagster tracks, runs on schedule or by trigger.

Dagster vs Apache Airflow vs Prefect vs Mage:

Criterion	Dagster	Airflow	Prefect	Mage
Concept	Software-defined assets	DAGs of tasks	Flows & tasks	Blocks & pipelines
UI	Modern, convenient	Functional, dated	Cloud-based	Notebook-style
Testing	Built-in	Difficult	Medium	Basic
Typing	Strong (IO managers, types)	Weak	Medium	Weak
dbt integration	Native (dagster-dbt)	Via operators	Via blocks	Basic
Complexity	Medium	High	Low	Low

Why Dagster and not Airflow? Because Airflow is 2015. DAGs, operators, XCom — it all works, but feels like legacy. Dagster is built around the concept of assets — you describe not "what to run" but "what data to produce." It fundamentally changes the approach.

I want to specifically highlight Dagster's dependency injection mechanism — it's really well done and very intuitive. It somehow reminds me of university days and Dagger 2 for Android, where everything was maximally confusing. Here — everything is immediately clear, everything makes sense.

Another Dagster plus — native support for running Airbyte, Airflow, and other tools. So if you need to migrate from another orchestrator or integrate an external tool — it's dead simple, no hacks needed.

dbt: Transformation

dbt handles transformations. SQL models that turn raw data into analytical tables. dbt works with Dremio through an adapter, letting us build models on top of Iceberg tables.

Important nuance: right now Dagster and dbt are not connected. dbt runs on push via GitLab CI/CD pipeline — you update a SQL model, push, and GitLab runs dbt run against Dremio. It works, and there's no urgent need to integrate dbt into Dagster. But we're planning to — it's the logical next step toward having a single orchestration point.

As of today, our dbt project consists of 445 models, 848 tests, 103 sources, and 652 macros. These aren't just SQL files — it's a full-fledged codebase with versioning, code review, and CI/CD.

A separate bonus: dbt generates documentation that we push to the Dremio wiki with a link to dbt docs.

MCP Gateway: AI Client for Dremio

One of the most interesting things in our stack is an MCP server (Model Context Protocol) connected to Dremio. Essentially, it's a gateway through which AI assistants (Claude, Cursor, any MCP client) can access our Lakehouse — run SQL queries, read table metadata, browse dbt documentation.

But the main feature isn't that. The main feature is access control. As I mentioned, Dremio Community Edition lacks proper role management — all users are essentially admins. The MCP gateway solves this problem: it has its own ACL layer that controls who has access to which datasets. Every request from an AI client goes through a permissions check — and if you don't have access to a certain space or table, you simply won't see it.

In other words, what Dremio itself lacks, we built at the gateway level. This allows us to safely grant data access to different people and AI agents — everyone sees only what they're permitted to.

dbt vs Dataform vs SQLMesh:

Criterion	dbt	Dataform	SQLMesh
Community	Huge	Smaller (Google)	Growing
Adapters	30+	BigQuery-centric	10+
Versioning	Git	Git	Git + automatic
Testing	Built-in	Built-in	Extended
Incremental	Standard	Standard	Advanced
Price	Core is free	Free (GCP)	Free

Apache Superset: BI for Everyone

We replaced Looker Studio with Apache Superset. It's a BI platform originally created at Airbnb and donated to the Apache Software Foundation. It connects to any SQL database — including Dremio.

Superset vs Metabase vs Looker Studio vs Grafana:

Criterion	Superset	Metabase	Looker Studio	Grafana
License	Apache 2.0	AGPL / Commercial	Free (Google)	AGPL
SQL Lab	Full-featured	Yes, simplified	No	Explore
Visualizations	50+ types	15+	30+	For metrics
Self-hosted	Yes	Yes	No	Yes
Customization	High	Medium	Low	High
Role model	Full (RBAC)	Basic	Google accounts	Org/Team
For analytics	Ideal	Good	Basic	For monitoring

Superset is tedious to set up — the initial setup is a quest of its own, the UI glitches sometimes, some charts require acrobatics, and the documentation... let's say it exists. But once everything is configured — it's great. For a free product — it's the best BI option we've found.

The OSS Problem: Where "Free" Ends

The biggest myth about open-source — "it's free." Yes, licenses are free. But:

Setup time. Every tool needs to be deployed, configured, integrated. That's weeks of work.
No support. When something breaks — you're on your own. GitHub Issues, Stack Overflow, Discord — and pray someone answers.
Role management. As I mentioned, in Dremio CE everyone's an admin. Airbyte OSS isn't great with permissions either. Superset has proper RBAC, but configuring it is a quest in itself.
Updates. Every update of every component is a potential breaking change. And nobody guarantees that a new major Airbyte version will be compatible with your config — we felt this firsthand when upgrading to 2.0.

But at the same time — we fully control our infrastructure. No vendor lock-in. No surprise AWS bills. And most importantly — we understand every component of the system, because we set it up ourselves.

Conclusions

Our journey from Google Sheets to Data Lakehouse took about two years. Not because it's hard to do all at once — but because each tool was added as problems and needs arose. It was an organic process with mistakes, rollbacks, and "let's try this instead." But now we have:

A single point of access to data — Dremio, where all sources are unified
Data versioning — Nessie, where every change is tracked
Reliable storage — MinIO + Iceberg, where data is stored in an open format
Automation — Dagster for orchestration, dbt for transformations via GitLab CI/CD
BI for everyone — Superset, where managers build dashboards themselves

Is this system perfect? No. There are issues with roles, bugs in OSS products, moments when you just want to pay for SaaS and not think about it. But for our scale and budget — it's the best option we could build.

And the main takeaway: data is not about tools, it's about processes. The coolest stack won't help if people don't understand why it's needed. We started simple — showed how to get answers to questions faster. Everything else followed naturally.

Comments (0)