Back to blog

Why We Publish Open Data

3 min read data open data

At AUTOMOTO, we live off data. We aggregate vehicle listings from dozens of sources, normalize them, enrich them, and serve them to users. Every day our pipelines parse thousands of records where the same Toyota can appear as TOYOTA, Toyota, ТОЙОТА, or even ТОУОТА. And that's just the beginning.

What's Wrong with Government Data

Ukraine's Ministry of Internal Affairs publishes a vehicle registry on data.gov.ua — 146 CSV files spanning 2013–2026, roughly 46.8 GB of raw data. Sounds great. In practice, it's a catalog of pain:

  • Encoding — files mix UTF-8 and Windows-1251. Ukrainian characters є, і, ї turn into mojibake.
  • Duplicates — data is published as cumulative snapshots. In 2023, out of ~32.3M raw rows only ~3.7M are unique — 88% duplication.
  • Column names — change between years. MAKE_YEAR, VYP, rik_vypusku are all the same field.
  • Delimiters — mostly ;, but some files use ,. File extensions sometimes use Cyrillic с instead of Latin c.
  • KOATUU codes — leading zeros stripped (classic Excel artifact), 100+ codes not found in any public registry.
  • BrandsMERCEDES-BENZ, MERSEDES-BENZ, МЕРСЕДЕС БЕНЦ, МЕРСЕДЕС-БЕНЗ are all the same manufacturer.
  • Placeholders — instead of null: "невизначено", a space, a dash, zero, the string "NULL".

That's only 7 of 12 documented problems. Every month we spent resources fighting the same issues.

Why We Published This

Not altruism. Pragmatism.

We had already cleaned this data for ourselves. Built a pipeline: encoding detection, column mapping, SHA-256 deduplication, brand normalization. ~24M unique records compressed from 46.8 GB CSV down to ~907 MB Parquet.

Same story with administrative codes. KOATUU (legacy classifier, ~48,700 records) and KATOTTG (new system, ~31,993 records) — needed for any geolocation work in Ukraine. We compiled both, including inactive codes not found in any other public source.

Keeping it internal means that everyone else who wants to work with this data goes through the same pain. Researchers, journalists, other companies — all stepping on the same rakes. And when someone finds a bug, we want to be the first to know.

The better the ecosystem-level data quality, the cheaper it is for us.

And there's another thing. Every pull request from someone outside is a free quality boost. Someone finds a brand mapping error, someone spots a missing KOATUU code, someone suggests better weight parsing logic — we get improvements without spending research time. Open-sourcing turns data consumers into contributors.

What We Published

UA Vehicle Registry — Data Quality Edition — cleaned MIA vehicle registry. ~24M unique records, 27 public columns (out of 80+ in the full pipeline), Apache Parquet + CSV. Monthly and yearly releases from 2013 to 2026. Each release includes a DQ report. License: CC BY 4.0.

UA Administrative Codes — KOATUU (~48,700 records) and KATOTTG (~31,993 records). Includes active and inactive codes. Apache Parquet + CSV. License: CC BY 4.0.

The two datasets are linked: the vehicle registry uses KOATUU codes for geolocation, and the admin codes repo is the only source where those 100+ "missing" codes can be found.

What's Next

Data updates monthly alongside new MIA publications. We continue improving brand/model normalization, KOATUU code reconciliation, and automated DQ checks.

If you work with Ukrainian data — try it, open issues, suggest fixes. The more eyes on this data, the better it gets for everyone.

(0)

Comments (0)