A comprehensive implementation of static word embedding techniques for natural language processing, featuring SVD, CBOW, and Skip-gram models.
This repository contains implementations of three prominent static word embedding techniques:
- SVD (Singular Value Decomposition) - A frequency-based approach using co-occurrence matrices
- CBOW (Continuous Bag of Words) - A prediction-based neural embedding model
- Skip-gram - A prediction-based neural embedding model with superior performance on semantic tasks
All models are implemented from scratch in PyTorch and trained on the Brown Corpus. The embeddings are evaluated using the WordSim-353 dataset to measure semantic similarity performance.
- Stop word removal
- Non-alphabetic token filtering
- Word frequency thresholding (min freq = 5)
- Context window definition (size = 2)
- Embedding dimension of 300
- SVD: Co-occurrence matrix + SVD + Normalization
- CBOW: Predicts target word from context with negative sampling
- Skip-gram: Predicts context words from target with negative sampling
Performance on WordSim-353 dataset (Spearman Correlation):
Model | Spearman Correlation |
---|---|
SVD | 0.17186670 |
CBOW | 0.29502401 |
Skip-gram | 0.32181557 |
pip install -r requirements.txt
# Train SVD embeddings
python svd.py
# Train CBOW embeddings
python cbow.py
# Train Skip-gram embeddings
python skipgram.py
# Evaluate SVD embeddings
python wordsim.py svd.pt
# Evaluate CBOW embeddings
python wordsim.py cbow.pt
# Evaluate Skip-gram embeddings
python wordsim.py skipgram.pt
├── svd.py # SVD implementation
├── cbow.py # CBOW implementation
├── skipgram.py # Skip-gram implementation
├── wordsim.py # Word similarity evaluation
├── utils.py # Utility functions
├── requirements.txt # Dependencies
├── svd.pt # Trained SVD embeddings
├── cbow.pt # Trained CBOW embeddings
├── skipgram.pt # Trained Skip-gram embeddings
└── report.pdf # Detailed analysis report
The repository includes t-SNE visualizations of word embeddings, demonstrating the clustering and relationships captured by each model.
- Skip-gram performs best at capturing semantic relationships
- CBOW offers a good balance between performance and training efficiency
- SVD provides fast training but with limited semantic capture
- Neural models (CBOW and Skip-gram) significantly outperform matrix factorization (SVD)
- Mikolov et al. (2013). Efficient Estimation of Word Representations in Vector Space
- Mikolov et al. (2013). Distributed Representations of Words and Phrases and their Compositionality
- Goldberg & Levy (2014). word2vec Explained: Deriving Mikolov et al.'s Negative-Sampling Word-Embedding Method