High-Performance Data Pipelines in TensorFlow using tf.data
Efficient data loading and preprocessing are critical for training deep learning models at scale.
The goal of this project is to demonstrate best practices for building high-performance input pipelines with TensorFlow’s tf.data
API, enabling faster training and optimal hardware utilization.
The notebook follows a step-by-step approach to constructing tf.data
pipelines:
- Dataset creation – Build datasets from in-memory arrays, tensors, and file sources.
- Data transformation – Apply mapping functions for preprocessing (e.g., normalization, resizing).
- Shuffling and batching – Randomize data order and group into mini-batches for training.
- Performance optimization – Use
cache()
,prefetch()
, andAUTOTUNE
to reduce input bottlenecks. - Iteration and inspection – Loop through datasets to validate contents and preprocessing logic.
From the code:
- TensorFlow –
tf.data
API for dataset creation, transformation, and performance tuning. - NumPy – Generating synthetic data for demonstration.
Not provided – The notebook demonstrates pipelines using synthetic data arrays and tensors.
Requirements:
pip install tensorflow numpy
Run the notebook:
jupyter notebook tf_data.ipynb
or in JupyterLab:
jupyter lab tf_data.ipynb
Execute cells sequentially to reproduce the pipeline demonstrations.
- Created pipelines from multiple data sources (arrays, tensors, files).
- Applied preprocessing transformations directly in the dataset pipeline.
- Implemented shuffling, batching, and prefetching to improve throughput.
- Demonstrated
AUTOTUNE
for dynamic performance optimization.
Example snippet:
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(1000).batch(32).prefetch(tf.data.AUTOTUNE)
Pipeline with batching and prefetching:
<BatchDataset shapes: ((None, 28, 28), (None,)), types: (tf.float32, tf.int64)>
Iterating through dataset:
Features: tf.Tensor([...], shape=(32, 28, 28), dtype=float32)
Labels: tf.Tensor([...], shape=(32,), dtype=int64)
tf.data
allows flexible and composable data transformations directly in TensorFlow graphs.- Prefetching and caching greatly improve GPU utilization during training.
- Using
AUTOTUNE
automates performance tuning without manual buffer sizing. - A well-designed input pipeline can significantly reduce training time for large datasets.
💡 Some interactive outputs (e.g., plots, widgets) may not display correctly on GitHub. If so, please view this notebook via nbviewer.org for full rendering.
Mehran Asgari Email: imehranasgari@gmail.com GitHub: https://github.com/imehranasgari
This project is licensed under the Apache 2.0 License – see the LICENSE
file for details.