Overview

XPK (Accelerated Processing Kit, pronounced x-p-k) is a command line interface that simplifies cluster creation and workload execution on Google Kubernetes Engine (GKE). XPK generates preconfigured, training-optimized clusters and allows easy workload scheduling without any Kubernetes expertise.

XPK is recommended for quick creation of GKE clusters for proofs of concepts and testing.

XPK decouples provisioning capacity from running jobs. There are two structures: clusters (provisioned VMs) and workloads (training jobs). Clusters represent the physical resources you have available. Workloads represent training jobs -- at any time some of these will be completed, others will be running and some will be queued, waiting for cluster resources to become available.

The ideal workflow starts by provisioning the clusters for all of the ML hardware you have reserved. Then, without re-provisioning, submit jobs as needed. By eliminating the need for re-provisioning between jobs, using Docker containers with pre-installed dependencies and cross-ahead of time compilation, these queued jobs run with minimal start times. Further, because workloads return the hardware back to the shared pool when they complete, developers can achieve better use of finite hardware resources. And automated tests can run overnight while resources tend to be underutilized.

XPK supports a variety of hardware accelerators.

Accelerator	Type	Create Cluster	Create Workload
TPU	Ironwood (tpu7x) [NEW]	docs	docs
TPU	Trillium (v6e)	docs	docs
TPU	v5p	docs	docs
TPU	v5e	docs	docs
TPU	v4	docs	docs
GPU	A100	docs	docs
GPU	A3-Highgpu (h100)	docs	docs
GPU	A3-Mega (h100-mega)	docs	docs
GPU	A3-Ultra (h200)	docs	docs
GPU	A4 (b200)	docs	docs
GPU	A4X (gb200)	docs	docs
CPU	n2-standard-32	docs	docs

XPK also supports the following Google Cloud Storage solutions:

Storage Type	Documentation
Cloud Storage FUSE	docs
Filestore	docs
Parallelstore	docs
Block storage (Persistent Disk, Hyperdisk)	docs

Documentation

Permissions
Installation
Usage:
- Clusters
  - GPU
  - CPU
  - Autoprovisioning
- Workloads
  - Docker
- Storage
- Advanced
- Inspector
- Run
- Job
Troubleshooting
Local Testing

Contributing

Please read contributing.md for details on our code of conduct, and the process for submitting pull requests to us.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 685 Commits
.github		.github
data		data
docs		docs
examples		examples
goldens		goldens
src		src
tools		tools
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
backoff_retry.sh		backoff_retry.sh
golden_buddy.sh		golden_buddy.sh
goldens.yaml		goldens.yaml
pylintrc		pylintrc
pyproject.toml		pyproject.toml
xpk-large-scale-guide.sh		xpk-large-scale-guide.sh
xpk-notebooks.md		xpk-notebooks.md
xpk-slurm-commands.md		xpk-slurm-commands.md
xpk.py		xpk.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Documentation

Contributing

License

About

Uh oh!

Releases 17

Packages

Uh oh!

Contributors 48

Languages

License

AI-Hypercomputer/xpk

Folders and files

Latest commit

History

Repository files navigation

Overview

Documentation

Contributing

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Contributors 48

Languages

Packages