Run

- Generating Unsupervised & Supervised Data for Fine-Tuning Embedding Model

Data Generation Methods

Data Generation Scripts
- This repository contains scripts for generating training datasets.
- Each script creates both unsupervised samples and supervised samples (anchor, positive, negative) from an input CSV file.
- They differ in how they construct anchor–positive–negative pairs, how the data is split, and how negatives are sampled.

Baseline

Unsupervised learning data: converting all columns in a row to a text
Mapping method of supervised learning data:
- For each anchor, create anchor–positive pairs (number of positive columns pairs) using each positive column.
- Attach one random negative to each pair.

python gen_data/text_embedder_fine_tuning_data_gen_basic.py --data_path {csv data path} --encoding {encoding} --desc_col {anchor column} --category_col {hard negative column} --positive_cols {positive column1, ...} --output_unsupervised {unsupervised train data save path} --output_supervised {supervised train data save path}

Domainwise Version

Same as baseline, but with domain-based data separation. (Each domain gets its own dataset file.)

python gen_data/text_embedder_fine_tuning_data_gen_domainwise.py --data_path {csv data path} --encoding {encoding} --desc_col {anchor column} --category_col {hard negative column} --positive_cols {positive column1, ...} ----domain_col {domain column} --output_dir {unsupervised, supervised train data save folder path}

Multi-Negative with Positive Fusion Version

Unsupervised learning data: converting all columns in a row to a text
Mapping method of supervised learning data:
- All positive values are fused into a single string
  - E.g. "pos column1: xxx, pos column2: yyy, ..."
For each anchor, attach multiple hard negatives (default: 5).
One row per anchor

python gen_data/text_embedder_fine_tuning_data_gen_fusion_multineg.py --data_path {csv data path} --encoding {encoding} --desc_col {anchor column} --category_col {hard negative column} --positive_cols {positive column1, ...} ----domain_col {domain column} --output_dir {unsupervised, supervised train data save folder path} --num_negatives {num of negatives}

- Embedding Gemma (300M) Fine-Tuning Sample Code

Fine-Tuning Process

Run env

conda create --name gemma-embedding python=3.10 -y
conda info --envs
conda activate gemma-embedding
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121
pip install -r embedding_gemma_requirements.txt
pip install --upgrade accelerate transformers

Fine-tuning
- enter the Huggingface Token (huggingface_token) in the '.env'

export CUDA_VISIBLE_DEVICES=0
python embedding_gemma_fine_tuning_test.py

Results

- Query: I want to start a tax-free installment investment, what should I do?
Document: Opening a NISA Account -> 🤖 Score: 0.403728
Document: Opening a Regular Savings Account -> 🤖 Score: 0.329424
Document: Home Loan Application Guide -> 🤖 Score: 0.108175

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
gen_data		gen_data
train		train
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
embedding_gemma_fine_tuning_test.py		embedding_gemma_fine_tuning_test.py
embedding_gemma_requirements.txt		embedding_gemma_requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Contents

Run

- Generating Unsupervised & Supervised Data for Fine-Tuning Embedding Model

- Embedding Gemma (300M) Fine-Tuning Sample Code

Reference

- SimCSE (Simple Contrastive Learning of Sentence Embeddings)

- EmbeddingGemma Fine-Tuning

- MultipleNegativesRankingLoss, TripletLoss

Author

- LinkedIn

- Blog

- Email: qbxlvnf11@google.com, qbxlvnf11@naver.com

About

Uh oh!

Releases

Packages

Languages

License

qbxlvnf11/text-embedding-fine-tuning

Folders and files

Latest commit

History

Repository files navigation

Contents

Run

- Generating Unsupervised & Supervised Data for Fine-Tuning Embedding Model

- Embedding Gemma (300M) Fine-Tuning Sample Code

Reference

- SimCSE (Simple Contrastive Learning of Sentence Embeddings)

- EmbeddingGemma Fine-Tuning

- MultipleNegativesRankingLoss, TripletLoss

Author

- LinkedIn

- Blog

- Email: qbxlvnf11@google.com, qbxlvnf11@naver.com

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages