This project continues the IMDB sentiment analysis task and addresses key efficiency and modeling issues by:
- using within‑batch padding (shorter sequences padded only to the longest in the batch, not the dataset),
- introducing word embeddings to replace one‑hot vectors,
- skipping computation on padded steps via Keras masking,
- leveraging Keras RNNs (LSTMs/GRUs, stacked or bidirectional) to simplify code and improve performance.
Previous part (low‑level RNN, Part 1): https://github.com/Ashly1991/rnn-text-classification-tf2
- Efficient batching:
from_generator+padded_batch(pad to each batch’s max length). - Optional bucketing: group sequences by similar length to reduce padding waste (helps more with larger truncation limits such as 500).
- Embeddings: compact, learnable representations replace one‑hot vectors; faster + fewer parameters for suitable
emb_dim. - Keras RNNs: use optimized
LSTM/GRU, easily stack layers and add Bidirectional context. - Masking:
Embedding(mask_zero=True)propagates masks so RNNs skip padded steps → better, faster learning in terms of steps. - Cleaner training loop:
model.fitwith built‑in metrics/callbacks (still compatible with custom loops if needed).
import tensorflow as tf
from tensorflow.keras import layers, models
vocab_size = 20000
emb_dim = 128
model = models.Sequential([
layers.Embedding(vocab_size, emb_dim, mask_zero=True),
layers.Bidirectional(layers.LSTM(128, return_sequences=False)),
layers.Dense(1, activation="sigmoid")
])
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])- Within‑batch padding: Build a
tf.datapipeline from a Python generator and applypadded_batchso each batch pads only to its own max length. - Bucketing: Optional length‑based grouping to avoid “one long sequence slows the whole batch”.
- RaggedTensors: Supported by many ops and Keras layers but not by
padded_batch; ragged pipelines can be slower in practice.
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
jupyter lab rnn-text-classification-keras.ipynb- Consider truncation (e.g., 200 or 500) and vocabulary limits for speed/quality trade‑offs.
- Try LSTM vs GRU, stacked vs single layer, and bidirectional variants.
- Monitor per‑batch time to see gains from bucketing.