Skip to content

IMDB sentiment analysis with Keras RNNs (LSTM/GRU). Within-batch padding, bucketing, embeddings, and masking for efficient, accurate training.

License

Notifications You must be signed in to change notification settings

Ashly1991/rnn-text-classification-keras-tf2

Repository files navigation

RNN Text Classification with Keras — Embeddings, Masking, and Efficient Batching

This project continues the IMDB sentiment analysis task and addresses key efficiency and modeling issues by:

  • using within‑batch padding (shorter sequences padded only to the longest in the batch, not the dataset),
  • introducing word embeddings to replace one‑hot vectors,
  • skipping computation on padded steps via Keras masking,
  • leveraging Keras RNNs (LSTMs/GRUs, stacked or bidirectional) to simplify code and improve performance.

Previous part (low‑level RNN, Part 1): https://github.com/Ashly1991/rnn-text-classification-tf2

What’s new in this repo (beyond Part 1)

  • Efficient batching: from_generator + padded_batch (pad to each batch’s max length).
  • Optional bucketing: group sequences by similar length to reduce padding waste (helps more with larger truncation limits such as 500).
  • Embeddings: compact, learnable representations replace one‑hot vectors; faster + fewer parameters for suitable emb_dim.
  • Keras RNNs: use optimized LSTM/GRU, easily stack layers and add Bidirectional context.
  • Masking: Embedding(mask_zero=True) propagates masks so RNNs skip padded steps → better, faster learning in terms of steps.
  • Cleaner training loop: model.fit with built‑in metrics/callbacks (still compatible with custom loops if needed).

Quick model sketch

import tensorflow as tf
from tensorflow.keras import layers, models

vocab_size = 20000
emb_dim = 128

model = models.Sequential([
    layers.Embedding(vocab_size, emb_dim, mask_zero=True),
    layers.Bidirectional(layers.LSTM(128, return_sequences=False)),
    layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

Training efficiency

  • Within‑batch padding: Build a tf.data pipeline from a Python generator and apply padded_batch so each batch pads only to its own max length.
  • Bucketing: Optional length‑based grouping to avoid “one long sequence slows the whole batch”.
  • RaggedTensors: Supported by many ops and Keras layers but not by padded_batch; ragged pipelines can be slower in practice.

How to Run

python -m venv .venv && source .venv/bin/activate     # Windows: .venv\Scripts\activate
pip install -r requirements.txt
jupyter lab rnn-text-classification-keras.ipynb

Notes

  • Consider truncation (e.g., 200 or 500) and vocabulary limits for speed/quality trade‑offs.
  • Try LSTM vs GRU, stacked vs single layer, and bidirectional variants.
  • Monitor per‑batch time to see gains from bucketing.

About

IMDB sentiment analysis with Keras RNNs (LSTM/GRU). Within-batch padding, bucketing, embeddings, and masking for efficient, accurate training.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published