DVX - Minimal Data Version Control

DVX is a lightweight wrapper around DVC that provides core data versioning with several enhancements:

Parallel pipeline execution with per-file provenance tracking
Decentralized workflow definitions - each .dvc file contains its computation, deps, and outputs
Enhanced diff with preprocessing pipelines and directory support
Cache introspection commands for examining cached data
Performance optimizations for large repos (batched git lookups, mtime caching)

Why DVX?

Decentralized Pipelines

DVC stores all pipeline stages in a single dvc.yaml file. DVX takes a different approach: each .dvc file contains its own computation metadata:

# output.parquet.dvc
outs:
  - md5: abc123...
    size: 1048576
    hash: md5
    path: output.parquet
meta:
  computation:
    cmd: python process.py input.parquet output.parquet
    code_ref: main:a1b2c3d
    deps:
      - path: input.parquet
        md5: def456...

This means:

Parallel execution: Independent artifacts run concurrently
Self-contained provenance: Each output knows exactly how it was created
Git-friendly: Changes to one artifact don't touch other files
No lock contention: Multiple processes can add artifacts simultaneously

Enhanced Diff

Diff DVC-tracked files between commits, optionally piping through preprocessing commands first. This is especially useful for binary formats like Parquet, gzipped files, or any format that benefits from transformation before diffing.

# Content diff of a file (HEAD vs worktree)
dvx diff data.csv

# Diff between commits
dvx diff -r HEAD^..HEAD data.csv

# Diff specific commit vs its parent
dvx diff -R abc123 data.csv

# Summary mode: show changed files with hashes
dvx diff -s
dvx diff -s -r HEAD~5..HEAD

Preprocessing Pipelines

The real power is piping files through commands before diffing:

# Compare line counts
dvx diff wc -l data.csv

# Compare Parquet schema (using parquet2json)
dvx diff parquet2json {} schema data.parquet

# Compare first row as pretty JSON
dvx diff 'parquet2json {} cat -l 1 | jq .' data.parquet

# Decompress and compare headers of gzipped CSVs
dvx diff 'gunzip -c {} | head -n1' data.csv.gz

# Compare sorted, deduplicated content
dvx diff 'sort {} | uniq' data.txt

Directory Diffs

When diffing DVC-tracked directories, DVX shows which files changed with their hashes:

$ dvx diff -R abc123 data/
test.parquet: c07bba3f... -> f46dd86f...
test.txt: e20b902b... -> 9306ec07...

Cache Introspection

Examine cached data without checkout:

# Get cache path for a tracked file
dvx cache path data.parquet

# Get MD5 hash
dvx cache md5 data.parquet

# View cached file contents directly
dvx cat data.csv

# Works with files inside DVC-tracked directories
dvx cat data_dir/subset.parquet

Installation

pip install dvx

# With S3 support
pip install dvx[s3]

# With all remote backends
pip install dvx[all]

Usage

Running Pipelines

# Run all .dvc computations (parallel by default)
dvx run

# Run specific target
dvx run output.dvc

# Use 4 parallel workers
dvx run -j 4

# Dry-run to see execution plan
dvx run --dry-run

# Force re-run (ignore freshness)
dvx run --force

Tracking Data

# Initialize
dvx init

# Track files (parallel-safe, lock-free)
dvx add data/
dvx add model.pkl

# Track with dependencies (for provenance)
dvx add output.parquet --dep input.parquet --cmd "python process.py"

# Auto-add stale deps first (recursive)
dvx add -r output.parquet

Status and Diff

# Check freshness (data vs deps)
dvx status
dvx status -v          # also show fresh files
dvx status --yaml      # detailed YAML output with hashes
dvx status -j4 data/   # parallel checking

# Content diff
dvx diff data.parquet
dvx diff -r HEAD^..HEAD results/
dvx diff -s            # summary mode (files + hashes)

Push/Pull

# Configure remote
dvx remote add -d myremote s3://mybucket/dvc

# Push to remote
dvx push
dvx push --dry-run     # see what would be pushed

# Pull from remote
dvx pull
dvx pull --dry-run     # see what would be pulled

# Ref-specific operations
dvx pull -r HEAD~3     # pull data as of 3 commits ago

Python API

from dvx import Repo

# Initialize
repo = Repo.init()

# Or open existing
with Repo() as repo:
    repo.add("data/")
    repo.push()

    status = repo.status()
    diff = repo.diff("HEAD~1")

Commands

Command	Description
`run`	Execute computations from .dvc files (parallel)
`add`	Track file(s) with optional provenance
`status`	Show freshness of tracked files (data & deps)
`diff`	Content diff with preprocessing support
`cache`	Inspect cache (path, md5, dir)
`cat`	View cached file contents
`push`	Upload data to remote storage
`pull`	Download data from remote storage
`fetch`	Download to cache (no checkout)
`checkout`	Restore data files from cache
`gc`	Garbage collect unused cache
`init`	Initialize a DVX repository
`remote`	Manage remotes
`config`	Configure settings
`remove`	Stop tracking file(s)
`move`	Move tracked file(s)
`import`	Import from another DVC repo
`import-url`	Import from a URL
`get`	Download without tracking
`get-url`	Download URL without tracking
`shell-integration`	Output shell aliases

What's Different from DVC

Added in DVX

dvx run - Parallel pipeline execution with per-file provenance
dvx cache path/md5 - Cache introspection
dvx cat - View cached files directly
dvx diff preprocessing - Pipe through commands before diffing
dvx status --yaml - Detailed status with hashes
Lock-free parallel add operations
Git blob batching for faster status checks

Not Included

DVX intentionally excludes DVC's heavier features:

Central pipelines (dvc.yaml, dvc repro, dvc dag)
Experiments (dvc exp, experiment tracking)
Metrics/Params/Plots (dvc metrics, dvc params, dvc plots)

If you need these features, use DVC directly.

Freshness Model

DVX tracks two types of freshness for each artifact:

Data freshness: Does the actual data match the hash in the .dvc file?
Dep freshness: Do recorded dependency hashes match the deps' .dvc files?

$ dvx status s3/output/
✗ s3/output/result.parquet.dvc (data changed (abc123... vs def456...))
✗ s3/output/summary.json.dvc (dep changed: s3/input/data.parquet)
✓ s3/output/metadata.json.dvc (up-to-date)

Provenance Tracking

When adding outputs with dependencies:

Deps must be fresh: dvx add errors if any dep's hash differs from its .dvc hash
Recursive add: Use dvx add -r to auto-add stale deps first
Accurate recording: Recorded dep hashes always match what was actually used

Performance

DVX is optimized for large repos:

Mtime caching: SQLite-backed cache skips unchanged files
Batched git lookups: Single git ls-tree call for all blob SHAs
Lock-free adds: Parallel-safe via atomic file writes
Parallel status: Check many files concurrently with -j/--jobs
Parallel runs: Independent computations execute concurrently

Compatibility

DVX uses .dvc files - fully compatible with DVC
DVX repos are DVC repos - you can use dvc commands too
DVC plugins (dvc-s3, dvc-gs, etc.) work with DVX

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
src/dvx		src/dvx
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
PORTING_PLAN.md		PORTING_PLAN.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DVX - Minimal Data Version Control

Why DVX?

Decentralized Pipelines

Enhanced Diff

Preprocessing Pipelines

Directory Diffs

Cache Introspection

Installation

Usage

Running Pipelines

Tracking Data

Status and Diff

Push/Pull

Python API

Commands

What's Different from DVC

Added in DVX

Not Included

Freshness Model

Provenance Tracking

Performance

Compatibility

License

About

Uh oh!

Releases 1

Packages

Contributors 293

Uh oh!

Languages

License

runsascoded/dvx

Folders and files

Latest commit

History

Repository files navigation

DVX - Minimal Data Version Control

Why DVX?

Decentralized Pipelines

Enhanced Diff

Preprocessing Pipelines

Directory Diffs

Cache Introspection

Installation

Usage

Running Pipelines

Tracking Data

Status and Diff

Push/Pull

Python API

Commands

What's Different from DVC

Added in DVX

Not Included

Freshness Model

Provenance Tracking

Performance

Compatibility

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 293

Uh oh!

Languages

Packages