A curated list of awesome DataOps tools.
- Awesome DataOps
 - Resources
 - Contributing
 
Tools related to data cataloging.
- Amundsen - Data discovery and metadata engine for improving the productivity when interacting with data.
 - Apache Atlas - Provides open metadata management and governance capabilities to build a data catalog.
 - CKAN - Open-source DMS (data management system) for powering data hubs and data portals.
 - DataHub - LinkedIn's generalized metadata search & discovery tool.
 - Magda - A federated, open-source data catalog for all your big data and small data.
 - Marquez - Service for the collection, aggregation, and visualization of a data ecosystem's metadata.
 - Metacat - Unified metadata exploration API service for Hive, RDS, Teradata, Redshift, S3 and Cassandra.
 - OpenLineage - Open standard for metadata and lineage collection.
 - OpenMetadata - A Single place to discover, collaborate and get your data right.
 - Unity Catalog - Industry’s only universal catalog for data and AI.
 
Tools for performing data exploration.
- Apache Zeppelin - Enables data-driven, interactive data analytics and collaborative documents.
 - Jupyter Notebook - Web-based notebook environment for interactive computing.
 - JupyterLab - The next-generation user interface for Project Jupyter.
 - Jupytext - Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts.
 - Polynote - The polyglot notebook with first-class Scala support.
 
Tools for performing data ingestion.
- Amazon Kinesis - Easily collect, process, and analyze video and data streams in real time.
 - Apache Gobblin - A framework that simplifies common aspects of big data such as data ingestion.
 - Apache Kafka - Open-source distributed event streaming platform used by thousands of companies.
 - Apache Pulsar - Distributed pub-sub messaging platform with a flexible messaging model and intuitive API.
 - Embulk - A parallel bulk data loader that helps data transfer between various storages.
 - Fluentd - Collects events from various data sources and writes them to files.
 - Google PubSub - Ingest events for streaming into BigQuery, data lakes or operational databases.
 - Nakadi - A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues.
 - Pravega - An open source distributed storage service implementing Streams.
 - RabbitMQ - One of the most popular open source message brokers.
 
Tools related to data workflow/pipeline.
- Apache Airflow - A platform to programmatically author, schedule, and monitor workflows.
 - Apache Oozie - An extensible, scalable and reliable system to manage complex Hadoop workloads.
 - Azkaban - Batch workflow job scheduler created at LinkedIn to run Hadoop jobs.
 - Dagster - An orchestration platform for the development, production, and observation of data assets.
 - Luigi - Python module that helps you build complex pipelines of batch jobs.
 - Prefect - A workflow management system, designed for modern infrastructure.
 
Tools related to data processing (batch and stream).
- Apache Beam - A unified model for defining both batch and streaming data-parallel processing pipelines.
 - Apache Flink - An open source stream processing framework with powerful capabilities.
 - Apache Hadoop MapReduce - A framework for writing applications which process vast amounts of data.
 - Apache Nifi - An easy to use, powerful, and reliable system to process and distribute data.
 - Apache Samza - A distributed stream processing framework which uses Apache Kafka and Hadoop YARN.
 - Apache Spark - A unified analytics engine for large-scale data processing.
 - Apache Storm - An open source distributed realtime computation system.
 - Apache Tez - A generic data-processing pipeline engine envisioned as a low-level engine.
 - Faust - A stream processing library, porting the ideas from Kafka Streams to Python.
 
Tools for ensuring data quality.
- Cerberus - Lightweight, extensible data validation library for Python.
 - Cleanlab - Data-centric AI tool to detect (non-predefined) issues in ML data like label errors or outliers.
 - DataProfiler - A Python library designed to make data analysis, monitoring, and sensitive data detection easy.
 - Deequ - A library built on top of Apache Spark for measuring data quality in large datasets.
 - Great Expectations - A Python data validation framework that allows to test your data against datasets.
 - JSON Schema - A vocabulary that allows you to annotate and validate JSON documents.
 - SodaSQL - Data profiling, testing, and monitoring for SQL accessible data.
 
Tools related to data serialization.
- Apache Avro - A data serialization system which is compact, fast and provides rich data structures.
 - Apache ORC - A self-describing type-aware columnar file format designed for Hadoop workloads.
 - Apache Parquet - A columnar storage format which provides efficient storage and encoding of data.
 - Kryo - A fast and efficient binary object graph serialization framework for Java.
 - ProtoBuf - Language-neutral, platform-neutral, extensible mechanism for serializing structured data.
 
- Pigz - A parallel implementation of gzip for modern multi-processor, multi-core machines.
 - Snappy - Open source compression library that is fast, stable and robuts.
 
- Apache Hudi - Manages the storage of large analytical datasets on DFS.
 - Apache Iceberg - Open table format for huge analytic datasets.
 - Delta Lake - An open source project that enables building a Lakehouse architecture on top of data lakes.
 
Tools for performing data visualization (DataViz).
- Apache Superset - A modern data exploration and data visualization platform.
 - Count - SQL/drag-and-drop querying and visualisation tool based on notebooks.
 - Dash - Analytical Web Apps for Python, R, Julia, and Jupyter.
 - Data Studio - Reporting solution for power users who want to go beyond the data and dashboards of GA.
 - HUE - A mature SQL Assistant for querying Databases & Data Warehouses.
 - Lux - Fast and easy data exploration by automating the visualization and data analysis process.
 - Metabase - The simplest, fastest way to get business intelligence and analytics to everyone.
 - Redash - Connect to any data source, easily visualize, dashboard and share your data.
 - Tableau - Powerful and fastest growing data visualization tool used in the business intelligence industry.
 
Tools related to storing data in data warehouses (DW).
- Amazon Redshift - Accelerate your time to insights with fast, easy, and secure cloud data warehousing.
 - Apache Hive - Facilitates reading, writing, and managing large datasets residing in distributed storage.
 - Apache Kylin - An open source, distributed analytical data warehouse for big data.
 - Google BigQuery - Serverless, highly scalable, and cost-effective multicloud data warehouse.
 
Database tools for storing data.
- Apache Cassandra - Open source column based DBMS designed to handle large amounts of data.
 - Apache Druid - Designed to quickly ingest massive quantities of event data, and provide low-latency queries.
 - Apache HBase - An open-source, distributed, versioned, column-oriented store.
 - Scylla - Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies.
 
- Apache CouchDB - An open-source document-oriented NoSQL database, implemented in Erlang.
 - Elasticsearch - A distributed document oriented database with a RESTful search engine.
 - MongoDB - A cross-platform document database that uses JSON-like documents with optional schemas.
 - RethinkDB - The first open-source scalable database built for realtime applications.
 
- Age - A multi-model database that supports both graph and relational data models.
 - ArangoDB - A scalable open-source multi-model database natively supporting graph, document and search.
 - JanusGraph - Manage large graphs with billions of data distributed across a multi-machine cluster.
 - Memgraph - An open source graph database, built for real-time streaming data, compatible with Neo4j.
 - Neo4j - A high performance graph store with all the features expected of a mature and robust database.
 - Titan - A highly scalable graph database optimized for storing and querying large graphs.
 
- Apache Accumulo - A sorted, distributed key-value store that provides robust and scalable data storage.
 - Dragonfly - A modern in-memory datastore, fully compatible with Redis and Memcached APIs.
 - DynamoDB - Fast, flexible NoSQL database service for single-digit millisecond performance at any scale.
 - etcd - Distributed reliable key-value store for the most critical data of a distributed system.
 - EVCache - A distributed in-memory data store for the cloud.
 - Memcached - A high performance multithreaded event-based key/value cache store.
 - Redis - An in-memory key-value database that persists on disk.
 
- CockroachDB - A distributed database designed to build, scale, and manage data-intensive apps.
 - Crate - A distributed SQL database that makes it simple to store and analyze massive amounts of data.
 - MariaDB - A replacement of MySQL with more features, new storage engines and better performance.
 - MySQL - One of the most popular open source transactional databases.
 - PostgreSQL - An advanced RDBMS that supports an extended subset of the SQL standard.
 - RQLite - A lightweight, distributed relational database, which uses SQLite as its storage engine.
 - SQLite - A popular choice as embedded database software for local/client storage.
 
- Akumuli - Can be used to capture, store and process time-series data in real-time.
 - Atlas - An in-memory dimensional time series database.
 - InfluxDB - Scalable datastore for metrics, events, and real-time analytics.
 - QuestDB - An open source SQL database designed to process time series data, faster.
 - TimescaleDB - Open-source time-series SQL database optimized for fast ingest and complex queries.
 
- Milvus - An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy.
 - Pinecone - Managed and distributed vector similarity search used with a lightweight SDK.
 - Qdrant - An open source vector similarity search engine with extended filtering support.
 
Tools related to file system and data storage.
- Alluxio - A virtual distributed storage system.
 - Amazon Simple Storage Service (S3) - Object storage built to retrieve any amount of data from anywhere.
 - Apache Hadoop Distributed File System (HDFS) - A distributed file system.
 - GlusterFS - A software defined distributed storage that can scale to several petabytes.
 - Google Cloud Storage (GCS) - Object storage for companies of all sizes, to store any amount of data.
 - LakeFS - Open source tool that transforms your object storage into a Git-like repository.
 - LizardFS - A highly reliable, scalable and efficient distributed file system.
 - MinIO - High Performance, Kubernetes Native Object Storage compatible with Amazon S3 API.
 - SeaweedFS - A fast distributed storage system for blobs, objects, files, and data lake.
 - Swift - A distributed object storage system designed to scale from a single machine to thousands of servers.
 
Tools used for logging and monitoring data workflows.
- Grafana - Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, InfluxDB and more.
 - Loki - A horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus.
 - Prometheus - A monitoring system and time series database.
 - Whylogs - A tool for creating data logs, enabling monitoring for data drift and data quality issues.
 
Tools used for storing and serving metadata.
- Hive Metastore - Service that stores metadata related to Apache Hive and other services.
 - Metacat - Provides you information about what data you have, where it resides and how to process it.
 
Tools for testing and sharing SQL snippets in mock databases.
- RunSQL - Free online SQL playground for MySQL, PostgreSQL, and SQL Server.
 - SQLFiddle - Online SQL compiler for learning and practicing SQL.
 
Tools for parallel processing SQL statements.
- Apache Drill - Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.
 - Apache Impala - Lightning-fast, distributed SQL queries for petabytes of data.
 - Dremio - Power high-performing BI dashboards and interactive analytics directly on data lake.
 - Presto - A distributed SQL query engine for big data.
 - Trino - A fast distributed SQL query engine for big data analytics.
 
Where to discover new tools and discuss about existing ones.
- Data Mesh: Delivering Data-Driven Value at Scale (O'Reilly)
 - Designing Data-Intensive Applications (O'Reilly)
 - Fundamentals of Data Engineering (O'Reilly)
 - Getting Started with Impala (O'Reilly)
 - Learning and Operating Presto (O'Reilly)
 - Learning Spark: Lightning-Fast Data Analytics (O'Reilly)
 - Spark in Action (O'Reilly)
 - Spark: The Definitive Guide (O'Reilly)
 
All contributions are welcome! Please take a look at the contribution guidelines first.