Data Version Control

2022/03/14

Data Versioning

Version control is an important component when working with Machine Learning. It also increase the speed of development and reduce mistakes

Increase the speed of development
- Reproduciblity
- Data Sharing
- Change Comparison
Reduce errors
- Data Provenance - How the data derived
- Revert to the correct version, when you accidentally change something
- Debugging

Tools

There several ways to implement data versioning ranging from full duplication, to space-efficience approaches, to more advance tools (# Best 7 Data Version Control Tools - Neptune Blog, # Comparing Data Version Control Tools - DagsHub, # Top 14 Data Versioning Tools - StartupStash)

Example data tools comparison from DagsHub

One of them is DVC which is open-source, lightweight, storage-agnostic, and Git-compatible.

DVC

What is DVC?

As mention in their website. DVC is an open-source version control system for Machine Learning Projects which is Git-compatible, Storage agnostic, Lightweight, and others.

When to use DVC?

Refer to DVC Use Cases

How?

Refer to DVC Documentation which is very clear and easy to follow.

Personally, I'll split the How section into 3 main parts.

Data Version Control - Including the concepts when dealing with data, versioning data with DVC, sharing data across person, or projects, and Data Registries
Machine Learning Pipeline - Connect ML steps with dvc stage, reproduce pipeline with dvc repro, compare versioning with dvc params, or dvc metrics, and visualize (plot) the time-series data with dvc plots
Experimentation - with DVC 2.0, we can easily track experiments, compare, checkout, apply, and also sharing experiments using dvc exp

1. Data Version Control

Commands

2. Machine Learning Pipeline

Commands

3. Experimentation

Commands

dvc exp

Interesting Resources

Data Versioning and Reproducible ML with DVC and MLflow shares how to use both tools to gether, so we can track data using DVC and track experiment using MLflow
Reproducible Machine Learning & Experiment Tracking Pipeline with Python and DVC shares a step by step on how to use DVC pipeline by using Kaggle dataset (Note starting around 28:20)
Experience report: Data Version Control (DVC) for Machine Learning Projects shares pros and cons when using DVC with their workflow. Interestingly, they also create their own tools (e.g., paired with pre-commit), and also use DVC to versioning entire Jupyter Notebook.

lukkiddd. 2022, powered by Jekyll Garden

Linkedin | Github | Twitter

Data Version Control

Data Versioning

Tools

Example data tools comparison from DagsHub

DVC

What is DVC?

When to use DVC?

How?

1. Data Version Control

Commands

Reference or Related

2. Machine Learning Pipeline

Commands

Reference or Related

3. Experimentation

Commands

Reference or Related

Interesting Resources