RNN Text Classification with Keras — Embeddings, Masking, and Efficient Batching

This project continues the IMDB sentiment analysis task and addresses key efficiency and modeling issues by:

using within‑batch padding (shorter sequences padded only to the longest in the batch, not the dataset),
introducing word embeddings to replace one‑hot vectors,
skipping computation on padded steps via Keras masking,
leveraging Keras RNNs (LSTMs/GRUs, stacked or bidirectional) to simplify code and improve performance.

Previous part (low‑level RNN, Part 1): https://github.com/Ashly1991/rnn-text-classification-tf2

What’s new in this repo (beyond Part 1)

Efficient batching: from_generator + padded_batch (pad to each batch’s max length).
Optional bucketing: group sequences by similar length to reduce padding waste (helps more with larger truncation limits such as 500).
Embeddings: compact, learnable representations replace one‑hot vectors; faster + fewer parameters for suitable emb_dim.
Keras RNNs: use optimized LSTM/GRU, easily stack layers and add Bidirectional context.
Masking: Embedding(mask_zero=True) propagates masks so RNNs skip padded steps → better, faster learning in terms of steps.
Cleaner training loop: model.fit with built‑in metrics/callbacks (still compatible with custom loops if needed).

Quick model sketch

import tensorflow as tf
from tensorflow.keras import layers, models

vocab_size = 20000
emb_dim = 128

model = models.Sequential([
    layers.Embedding(vocab_size, emb_dim, mask_zero=True),
    layers.Bidirectional(layers.LSTM(128, return_sequences=False)),
    layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

Training efficiency

Within‑batch padding: Build a tf.data pipeline from a Python generator and apply padded_batch so each batch pads only to its own max length.
Bucketing: Optional length‑based grouping to avoid “one long sequence slows the whole batch”.
RaggedTensors: Supported by many ops and Keras layers but not by padded_batch; ragged pipelines can be slower in practice.

How to Run

python -m venv .venv && source .venv/bin/activate     # Windows: .venv\Scripts\activate
pip install -r requirements.txt
jupyter lab rnn-text-classification-keras.ipynb

Notes

Consider truncation (e.g., 200 or 500) and vocabulary limits for speed/quality trade‑offs.
Try LSTM vs GRU, stacked vs single layer, and bidirectional variants.
Monitor per‑batch time to see gains from bucketing.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
gitattributes		gitattributes
gitignore		gitignore
requirements.txt		requirements.txt
rnn-text-classification-keras.ipynb		rnn-text-classification-keras.ipynb
setup_repo.sh		setup_repo.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RNN Text Classification with Keras — Embeddings, Masking, and Efficient Batching

What’s new in this repo (beyond Part 1)

Quick model sketch

Training efficiency

How to Run

Notes

About

Uh oh!

Releases

Packages

Languages

License

Ashly1991/rnn-text-classification-keras-tf2

Folders and files

Latest commit

History

Repository files navigation

RNN Text Classification with Keras — Embeddings, Masking, and Efficient Batching

What’s new in this repo (beyond Part 1)

Quick model sketch

Training efficiency

How to Run

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages