Skip to content

Conversation

luciaquirke
Copy link
Collaborator

@luciaquirke luciaquirke commented Sep 10, 2025

Add mechanistic interpretability inspired callback application: save gradients from training a 2-layer attention-only transformer and use influence function scores to find the induction heads formation step, using a small query set of relevant sequences.

TODO or remove:

  • Switch from mean loss to sum loss (Nora)

Library features:

  • Support querying FAISS for full scores (previously only TopK)

@luciaquirke luciaquirke force-pushed the induction branch 4 times, most recently from fe074ae to 050749f Compare September 16, 2025 07:30
@luciaquirke luciaquirke changed the base branch from main to heads September 18, 2025 03:41
@luciaquirke luciaquirke force-pushed the induction branch 2 times, most recently from b4fb366 to 8009ced Compare September 23, 2025 00:03
@luciaquirke luciaquirke changed the base branch from heads to main October 7, 2025 05:00
@norabelrose
Copy link
Member

Should we merge this now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants