Project Nessie
Project Nessie is an open-source, transactional Data Catalog designed primarily to manage tables in an open Data Lakehouse (supporting formats like Apache Iceberg and Delta Lake).
Git-Like Data Management
Nessie introduces version control concepts to data engineering pipelines, allowing users to interact with their data lakehouse in a manner similar to Git:
- Branches: Users can create branches of their catalog (e.g., a
devbranch) to test new ingestions or transformations in isolation. Changes are metadata-only and do not duplicate underlying files. - Commits: Catalog operations are bundled as atomic commits, ensuring that concurrent readers always see consistent, un-corrupted states of the tables.
- Merges: Once isolation tests succeed, changes can be merged back into the main branch (
main) atomically, preventing half-written states from being exposed. - Tags & Time-Travel: Users can tag specific catalog states (e.g.,
q4_finance_close) and query exactly what the data looked like at that commit or tag.
By providing multi-table transaction guarantees across entire namespaces, Nessie enables robust, zero-copy data operations.
Part of the Data & AI Terms glossary.