Data Versioning
Version control is an important component when working with Machine Learning. It also increase the speed of development and reduce mistakes
- Increase the speed of development
- Reproduciblity
- Data Sharing
- Change Comparison
- Reduce errors
- Data Provenance - How the data derived
- Revert to the correct version, when you accidentally change something
- Debugging
Tools
There several ways to implement data versioning ranging from full duplication, to space-efficience approaches, to more advance tools (# Best 7 Data Version Control Tools - Neptune Blog, # Comparing Data Version Control Tools - DagsHub, # Top 14 Data Versioning Tools - StartupStash)
Example data tools comparison from DagsHub
One of them is DVC which is open-source, lightweight, storage-agnostic, and Git-compatible.
DVC
What is DVC?
As mention in their website. DVC is an open-source version control system for Machine Learning Projects which is Git-compatible, Storage agnostic, Lightweight, and others.
When to use DVC?
Refer to DVC Use Cases
How?
Refer to DVC Documentation which is very clear and easy to follow.
Personally, I'll split the How section into 3 main parts.
- Data Version Control - Including the concepts when dealing with data, versioning data with DVC, sharing data across person, or projects, and Data Registries
- Machine Learning Pipeline - Connect ML steps with
dvc stage
, reproduce pipeline with dvc repro, compare versioning withdvc params, or dvc metrics
, and visualize (plot) the time-series data withdvc plots
- Experimentation - with DVC 2.0, we can easily track experiments, compare, checkout, apply, and also sharing experiments using
dvc exp
1. Data Version Control
Commands
dvc init
dvc add
dvc remove
dvc commit
dvc checkout
dvc diff
dvc remote
dvc fetch
dvc push
dvc pull
dvc status
dvc list
dvc get
dvc get-url
dvc import
dvc import-url
dvc update
Reference or Related
- Get Started with Data and Model Versioning
- Get Started with Data and Model Access
- Set up a Google Drive DVC Remote
- External Dependencies
.dvc
Files.dvcignore
Files
2. Machine Learning Pipeline
Commands
Reference or Related
- Get Started with Data Pipelines
- Get Started with Metrics, Parameters, and Plots
- Pipeline Files (
dvc.yaml
) - How to Add Dependencie or Outputs
3. Experimentation
Commands
Reference or Related
Interesting Resources
- Data Versioning and Reproducible ML with DVC and MLflow shares how to use both tools to gether, so we can track data using DVC and track experiment using MLflow
- Reproducible Machine Learning & Experiment Tracking Pipeline with Python and DVC shares a step by step on how to use DVC pipeline by using Kaggle dataset (Note starting around 28:20)
- Experience report: Data Version Control (DVC) for Machine Learning Projects shares pros and cons when using DVC with their workflow. Interestingly, they also create their own tools (e.g., paired with
pre-commit
), and also use DVC to versioning entire Jupyter Notebook.