DVX is a lightweight wrapper around DVC that provides core data versioning with several enhancements:
- Parallel pipeline execution with per-file provenance tracking
- Decentralized workflow definitions - each
.dvcfile contains its computation, deps, and outputs - Enhanced diff with preprocessing pipelines and directory support
- Cache introspection commands for examining cached data
- Performance optimizations for large repos (batched git lookups, mtime caching)
DVC stores all pipeline stages in a single dvc.yaml file. DVX takes a different approach: each .dvc file contains its own computation metadata:
# output.parquet.dvc
outs:
- md5: abc123...
size: 1048576
hash: md5
path: output.parquet
meta:
computation:
cmd: python process.py input.parquet output.parquet
code_ref: main:a1b2c3d
deps:
- path: input.parquet
md5: def456...This means:
- Parallel execution: Independent artifacts run concurrently
- Self-contained provenance: Each output knows exactly how it was created
- Git-friendly: Changes to one artifact don't touch other files
- No lock contention: Multiple processes can add artifacts simultaneously
Diff DVC-tracked files between commits, optionally piping through preprocessing commands first. This is especially useful for binary formats like Parquet, gzipped files, or any format that benefits from transformation before diffing.
# Content diff of a file (HEAD vs worktree)
dvx diff data.csv
# Diff between commits
dvx diff -r HEAD^..HEAD data.csv
# Diff specific commit vs its parent
dvx diff -R abc123 data.csv
# Summary mode: show changed files with hashes
dvx diff -s
dvx diff -s -r HEAD~5..HEADThe real power is piping files through commands before diffing:
# Compare line counts
dvx diff wc -l data.csv
# Compare Parquet schema (using parquet2json)
dvx diff parquet2json {} schema data.parquet
# Compare first row as pretty JSON
dvx diff 'parquet2json {} cat -l 1 | jq .' data.parquet
# Decompress and compare headers of gzipped CSVs
dvx diff 'gunzip -c {} | head -n1' data.csv.gz
# Compare sorted, deduplicated content
dvx diff 'sort {} | uniq' data.txtWhen diffing DVC-tracked directories, DVX shows which files changed with their hashes:
$ dvx diff -R abc123 data/
test.parquet: c07bba3f... -> f46dd86f...
test.txt: e20b902b... -> 9306ec07...Examine cached data without checkout:
# Get cache path for a tracked file
dvx cache path data.parquet
# Get MD5 hash
dvx cache md5 data.parquet
# View cached file contents directly
dvx cat data.csv
# Works with files inside DVC-tracked directories
dvx cat data_dir/subset.parquetpip install dvx
# With S3 support
pip install dvx[s3]
# With all remote backends
pip install dvx[all]# Run all .dvc computations (parallel by default)
dvx run
# Run specific target
dvx run output.dvc
# Use 4 parallel workers
dvx run -j 4
# Dry-run to see execution plan
dvx run --dry-run
# Force re-run (ignore freshness)
dvx run --force# Initialize
dvx init
# Track files (parallel-safe, lock-free)
dvx add data/
dvx add model.pkl
# Track with dependencies (for provenance)
dvx add output.parquet --dep input.parquet --cmd "python process.py"
# Auto-add stale deps first (recursive)
dvx add -r output.parquet# Check freshness (data vs deps)
dvx status
dvx status -v # also show fresh files
dvx status --yaml # detailed YAML output with hashes
dvx status -j4 data/ # parallel checking
# Content diff
dvx diff data.parquet
dvx diff -r HEAD^..HEAD results/
dvx diff -s # summary mode (files + hashes)# Configure remote
dvx remote add -d myremote s3://mybucket/dvc
# Push to remote
dvx push
dvx push --dry-run # see what would be pushed
# Pull from remote
dvx pull
dvx pull --dry-run # see what would be pulled
# Ref-specific operations
dvx pull -r HEAD~3 # pull data as of 3 commits agofrom dvx import Repo
# Initialize
repo = Repo.init()
# Or open existing
with Repo() as repo:
repo.add("data/")
repo.push()
status = repo.status()
diff = repo.diff("HEAD~1")| Command | Description |
|---|---|
run |
Execute computations from .dvc files (parallel) |
add |
Track file(s) with optional provenance |
status |
Show freshness of tracked files (data & deps) |
diff |
Content diff with preprocessing support |
cache |
Inspect cache (path, md5, dir) |
cat |
View cached file contents |
push |
Upload data to remote storage |
pull |
Download data from remote storage |
fetch |
Download to cache (no checkout) |
checkout |
Restore data files from cache |
gc |
Garbage collect unused cache |
init |
Initialize a DVX repository |
remote |
Manage remotes |
config |
Configure settings |
remove |
Stop tracking file(s) |
move |
Move tracked file(s) |
import |
Import from another DVC repo |
import-url |
Import from a URL |
get |
Download without tracking |
get-url |
Download URL without tracking |
shell-integration |
Output shell aliases |
dvx run- Parallel pipeline execution with per-file provenancedvx cache path/md5- Cache introspectiondvx cat- View cached files directlydvx diffpreprocessing - Pipe through commands before diffingdvx status --yaml- Detailed status with hashes- Lock-free parallel
addoperations - Git blob batching for faster status checks
DVX intentionally excludes DVC's heavier features:
- Central pipelines (
dvc.yaml,dvc repro,dvc dag) - Experiments (
dvc exp, experiment tracking) - Metrics/Params/Plots (
dvc metrics,dvc params,dvc plots)
If you need these features, use DVC directly.
DVX tracks two types of freshness for each artifact:
- Data freshness: Does the actual data match the hash in the
.dvcfile? - Dep freshness: Do recorded dependency hashes match the deps'
.dvcfiles?
$ dvx status s3/output/
✗ s3/output/result.parquet.dvc (data changed (abc123... vs def456...))
✗ s3/output/summary.json.dvc (dep changed: s3/input/data.parquet)
✓ s3/output/metadata.json.dvc (up-to-date)When adding outputs with dependencies:
- Deps must be fresh:
dvx adderrors if any dep's hash differs from its.dvchash - Recursive add: Use
dvx add -rto auto-add stale deps first - Accurate recording: Recorded dep hashes always match what was actually used
DVX is optimized for large repos:
- Mtime caching: SQLite-backed cache skips unchanged files
- Batched git lookups: Single
git ls-treecall for all blob SHAs - Lock-free adds: Parallel-safe via atomic file writes
- Parallel status: Check many files concurrently with
-j/--jobs - Parallel runs: Independent computations execute concurrently
- DVX uses
.dvcfiles - fully compatible with DVC - DVX repos are DVC repos - you can use
dvccommands too - DVC plugins (dvc-s3, dvc-gs, etc.) work with DVX
Apache 2.0