This repository contains the Jupyter notebooks from the course Evaluating and Debugging Generative AI, created by DeepLearning.AI in collaboration with Weights & Biases.
The course explores practical ways to evaluate, debug, and improve generative AI models while leveraging W&B for experiment tracking and visualization.
Learn how to set up and use Weights & Biases to track experiments, log results, and visualize performance.
Learning outcomes:
- Set up W&B in your environment
- Log metrics, predictions, and artifacts
- Visualize and compare experiments effectively
Understand how to train diffusion models (the backbone of many generative image models like Stable Diffusion) while using W&B to monitor training.
Learning outcomes:
- Explain the basic idea of diffusion models
- Train a diffusion model step by step
- Track training progress and generated samples with W&B
- Spot and debug issues during training
Discover methods for evaluating generative image models. Since traditional metrics like accuracy don’t apply, you’ll explore new approaches.
Learning outcomes:
- Apply quantitative metrics (e.g., FID, IS) to measure image quality
- Use visualizations for qualitative evaluation
- Combine automated and human evaluation strategies
- Understand the trade-offs between evaluation methods
Dive into large language models (LLMs) and learn how to evaluate their outputs systematically.
Learning outcomes:
- Set up tracing to capture prompts, responses, and metadata
- Evaluate LLM outputs for correctness, relevance, and safety
- Log and visualize evaluations in W&B
- Debug LLM behavior with structured traces
Get hands-on experience with fine-tuning large language models to adapt them for specific tasks.
Learning outcomes:
- Prepare and clean datasets for fine-tuning
- Fine-tune an LLM on a downstream task
- Track experiments and evaluate improvements with W&B
- Reflect on when fine-tuning is (and isn’t) the right approach
- These notebooks are designed for hands-on learning.
- You’ll need your own Weights & Biases account to run logging and tracking examples.
- Some exercises may require external APIs (e.g., Hugging Face or OpenAI).
This course and its materials are brought to you by:
- DeepLearning.AI – advancing AI education for everyone.
- Weights & Biases – tools for experiment tracking, model evaluation, and debugging.