Skip to content

AI & ML

Full Stack edited this page Apr 9, 2025 · 7 revisions

Simple Explanations for Beginners

image

Machine Learning

Key aspects

  1. Data
  2. Feature
  3. Model

Types of Machine Learning

  1. Supervised Learning

Training a model using labeled data where the outcome is already known

  1. Unsupervised Learning

It involves training a model with data that does not have labelled outcome, model tries to find patterns, similarities or groups within the data on its own.

  1. Reinforcement Learning

It is about training a model through trial and error where it receives rewards or penalities based on its actions.

RAG (Retrieval-Augmented Generation

RAG enhances the abilities of LLM by allowing them to access external data sources like databases or search engines, to improve accuracy of its responses.

For example: if you ask ChatGPT about the new tax regulations, it recognizes the need for recent information and because it is using RAG, it is able to retrieve relevant data from external sources like government websites to provide accurate response beyond its original training.

Terminologies

read

In the context of Large Language Models (LLMs) and AI/ML, a vector is a fundamental concept with many important roles. Let's break it down and then dive into other key terminologies.


  1. Vector (Embedding Vector)

In LLMs, a vector usually refers to a numerical representation of text (like a word, sentence, paragraph, or document).

This is called an embedding, which is a high-dimensional array of numbers (e.g., a 768-dimensional vector).

These vectors capture the semantic meaning of text. Words with similar meanings have vectors that are close together in this space.

Example:

"dog" -> [0.12, 0.45, -0.33, ..., 0.56]
"puppy" -> [0.14, 0.47, -0.30, ..., 0.59] (close to "dog")

Used in:

Search

Question answering

Similarity matching

Vector databases


  1. Embedding

An embedding is the actual vector representation of data.

Generated by a model like OpenAI's text-embedding-ada-002, SentenceTransformers, or LLaMA variants with an embedding head.

Embeddings reduce high-dimensional discrete data (like text) to continuous numeric form.


  1. Vector Space / Embedding Space

A multi-dimensional space where each vector (embedding) lives.

Semantic similarity is represented by distance or angle (e.g., cosine similarity).

Used in semantic search, retrieval-augmented generation (RAG), and clustering.


  1. Vector Database

Specialized databases that store vectors and allow fast similarity search.

Popular tools: FAISS, Pinecone, Weaviate, Chroma, Qdrant.

Allows you to retrieve relevant documents or info based on embedding similarity.


  1. Similarity Metrics

Metrics used to compare vectors:

Cosine similarity: Measures angle between vectors.

Euclidean distance: Measures straight-line distance.

Dot product: Raw similarity signal.

Used in nearest neighbor search.


  1. Nearest Neighbor Search (ANN)

A way to find the most similar vectors in the database.

Approximate Nearest Neighbor (ANN) techniques like HNSW make this efficient at scale.


  1. Retrieval-Augmented Generation (RAG)

Combines vector search with language models.

Steps:

  1. User asks a question.

  2. Query is embedded.

  3. Similar documents are retrieved from a vector database.

  4. LLM uses those docs to generate an accurate answer.


  1. Chunking

Breaking text into manageable chunks before embedding.

Necessary to embed long documents (e.g., 500 words per chunk).

Affects performance of retrieval in RAG.


  1. Tokenization

Breaking text into tokens (words, subwords, or characters).

Vectors are usually generated after tokenization.

Important for managing input length and model behavior.


  1. Positional Encoding / Context Window

Vectors are also used inside models to represent where a word appears in the input.

LLMs have a context window (e.g., 4K or 32K tokens) within which they process inputs.

After this, vector info can be lost unless retrieved again.


Clone this wiki locally