Data Version Control

2022/03/14


Data Versioning

Version control is an important component when working with Machine Learning. It also increase the speed of development and reduce mistakes

  • Increase the speed of development
    • Reproduciblity
    • Data Sharing
    • Change Comparison
  • Reduce errors
    • Data Provenance - How the data derived
    • Revert to the correct version, when you accidentally change something
    • Debugging

Tools

There several ways to implement data versioning ranging from full duplication, to space-efficience approaches, to more advance tools (# Best 7 Data Version Control Tools - Neptune Blog, # Comparing Data Version Control Tools - DagsHub, # Top 14 Data Versioning Tools - StartupStash)

Example data tools comparison from DagsHub

Example data tools comparison from DagsHub

One of them is DVC which is open-source, lightweight, storage-agnostic, and Git-compatible.


DVC

What is DVC?

As mention in their website. DVC is an open-source version control system for Machine Learning Projects which is Git-compatible, Storage agnostic, Lightweight, and others.

When to use DVC?

Refer to DVC Use Cases

How?

Refer to DVC Documentation which is very clear and easy to follow.

Personally, I'll split the How section into 3 main parts.

  • Data Version Control - Including the concepts when dealing with data, versioning data with DVC, sharing data across person, or projects, and Data Registries
  • Machine Learning Pipeline - Connect ML steps with dvc stage, reproduce pipeline with dvc repro, compare versioning with dvc params, or dvc metrics, and visualize (plot) the time-series data with dvc plots
  • Experimentation - with DVC 2.0, we can easily track experiments, compare, checkout, apply, and also sharing experiments using dvc exp

1. Data Version Control

Commands

2. Machine Learning Pipeline

Commands

3. Experimentation

Commands

Interesting Resources