-
Notifications
You must be signed in to change notification settings - Fork 0
AI & ML
- Data
- Feature
- Model
- Supervised Learning
Training a model using labeled data where the outcome is already known
- Unsupervised Learning
It involves training a model with data that does not have labelled outcome, model tries to find patterns, similarities or groups within the data on its own.
- Reinforcement Learning
It is about training a model through trial and error where it receives rewards or penalities based on its actions.
RAG enhances the abilities of LLM by allowing them to access external data sources like databases or search engines, to improve accuracy of its responses.
For example: if you ask ChatGPT about the new tax regulations, it recognizes the need for recent information and because it is using RAG, it is able to retrieve relevant data from external sources like government websites to provide accurate response beyond its original training.
read
In the context of Large Language Models (LLMs) and AI/ML, a vector is a fundamental concept with many important roles. Let's break it down and then dive into other key terminologies.
- Vector (Embedding Vector)
In LLMs, a vector usually refers to a numerical representation of text (like a word, sentence, paragraph, or document).
This is called an embedding, which is a high-dimensional array of numbers (e.g., a 768-dimensional vector).
These vectors capture the semantic meaning of text. Words with similar meanings have vectors that are close together in this space.
Example:
"dog" -> [0.12, 0.45, -0.33, ..., 0.56]
"puppy" -> [0.14, 0.47, -0.30, ..., 0.59] (close to "dog")
Used in:
Search
Question answering
Similarity matching
Vector databases
- Embedding
An embedding is the actual vector representation of data.
Generated by a model like OpenAI's text-embedding-ada-002, SentenceTransformers, or LLaMA variants with an embedding head.
Embeddings reduce high-dimensional discrete data (like text) to continuous numeric form.
- Vector Space / Embedding Space
A multi-dimensional space where each vector (embedding) lives.
Semantic similarity is represented by distance or angle (e.g., cosine similarity).
Used in semantic search, retrieval-augmented generation (RAG), and clustering.
- Vector Database
Specialized databases that store vectors and allow fast similarity search.
Popular tools: FAISS, Pinecone, Weaviate, Chroma, Qdrant.
Allows you to retrieve relevant documents or info based on embedding similarity.
- Similarity Metrics
Metrics used to compare vectors:
Cosine similarity: Measures angle between vectors.
Euclidean distance: Measures straight-line distance.
Dot product: Raw similarity signal.
Used in nearest neighbor search.
- Nearest Neighbor Search (ANN)
A way to find the most similar vectors in the database.
Approximate Nearest Neighbor (ANN) techniques like HNSW make this efficient at scale.
- Retrieval-Augmented Generation (RAG)
Combines vector search with language models.
Steps:
-
User asks a question.
-
Query is embedded.
-
Similar documents are retrieved from a vector database.
-
LLM uses those docs to generate an accurate answer.
- Chunking
Breaking text into manageable chunks before embedding.
Necessary to embed long documents (e.g., 500 words per chunk).
Affects performance of retrieval in RAG.
- Tokenization
Breaking text into tokens (words, subwords, or characters).
Vectors are usually generated after tokenization.
Important for managing input length and model behavior.
- Positional Encoding / Context Window
Vectors are also used inside models to represent where a word appears in the input.
LLMs have a context window (e.g., 4K or 32K tokens) within which they process inputs.
After this, vector info can be lost unless retrieved again.