Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
e35bfd2
polished up writing tensors on last hidden state. about to push to 3090
Mar 27, 2025
e0bde7f
added some helpful scripts
Mar 27, 2025
ec8d44e
some extra scripts
Mar 28, 2025
e5a7e38
script enhancements
Mar 28, 2025
5d1c520
checkpoint: fingerprint can be generated by running an inference and …
kyle-pena-kuzco Mar 28, 2025
c833e7e
got the proofs probably in the logitsoutput processor. the threading…
kyle-pena-kuzco Mar 28, 2025
8a6460d
checkpoint before i switch to piggybacking on eagle speculative decod…
kyle-pena-kuzco Mar 29, 2025
8dec243
temporary commit before opening draft PR for comparison purposes
kyle-pena-kuzco Mar 29, 2025
4fe0579
stripped out some of my first pass hidden state activation code
kyle-pena-kuzco Mar 29, 2025
fbb7c48
stripped out some more stuff
kyle-pena-kuzco Mar 29, 2025
dce61ab
checkpoint before finishing wiring up returning proofs if --toploc-fi…
kyle-pena-kuzco Mar 30, 2025
0097130
checkpoint just prior to adding a lot of logging for verification toploc
kyle-pena-kuzco Mar 30, 2025
95c81dd
backed out of some weird changes that claude made
kyle-pena-kuzco Mar 30, 2025
cc6cb05
more weird changes undone
kyle-pena-kuzco Mar 30, 2025
32f82da
checkpiont before simplifying flag setting
kyle-pena-kuzco Mar 30, 2025
dde6093
proofs are generating, they just aren't being included on the respons…
kyle-pena-kuzco Mar 30, 2025
499d745
checkpoint - can't seem to get proofs to get received and transmissio…
kyle-pena-kuzco Mar 31, 2025
49091f2
verification proofs are being included with the response, although fo…
kyle-pena-kuzco Mar 31, 2025
0f76945
checkpoint on working the verification_proof_to_validate through the …
kyle-pena-kuzco Mar 31, 2025
af11ec3
fixed a typing issue with list vs str for verification_proof_to_validate
kyle-pena-kuzco Mar 31, 2025
f290f4c
checkpoint: got the verification proof to validate all the way into t…
kyle-pena-kuzco Mar 31, 2025
456b485
got verification proof to validate into the ForwardBatch of the model…
kyle-pena-kuzco Mar 31, 2025
b45b6a6
got verification to execute (not yet returned)
kyle-pena-kuzco Mar 31, 2025
34c8a87
got the verification results appearing in the response
kyle-pena-kuzco Mar 31, 2025
0f4a0cc
implemented input_token_ids in response if requested. implementation…
kyle-pena-kuzco Mar 31, 2025
22b2a95
just got the output ids to go out with the response.
kyle-pena-kuzco Apr 1, 2025
e13288c
added scripts and prelim results with spoofing
kyle-pena-kuzco Apr 1, 2025
d0a6a09
made proof generation only happen on prefill and last token
kyle-pena-kuzco Apr 1, 2025
7aed417
added a nice verificatoin readme
kyle-pena-kuzco Apr 1, 2025
088100c
updated backgroudn color for diagrams
kyle-pena-kuzco Apr 1, 2025
5508da4
force added missing image
kyle-pena-kuzco Apr 1, 2025
287234f
added quiet flag for test_ultrachat.py
kyle-pena-kuzco Apr 1, 2025
1ea7221
added quiet mode for test_spoof_ultrachat.py
kyle-pena-kuzco Apr 1, 2025
b67bce5
update to readme
kyle-pena-kuzco Apr 2, 2025
06c973d
began cleanup of repo
kyle-pena-kuzco Apr 3, 2025
93406f1
continuing cleanup - removed toploc-scripts folder
kyle-pena-kuzco Apr 3, 2025
7bd3d01
ignored toploc-scripts folder
kyle-pena-kuzco Apr 3, 2025
c8ba0d2
extensive cleanup and renaming. re-testing not complete.
kyle-pena-kuzco Apr 3, 2025
d684d11
fixed misnamed renamed property
kyle-pena-kuzco Apr 3, 2025
68c55d9
fixed another bug
kyle-pena-kuzco Apr 3, 2025
548d8ec
fixed some final issues introduced by the cleanup
kyle-pena-kuzco Apr 4, 2025
da5511e
fixed CUDA graph recapture-per-token bug. will test eagle shortly.
kyle-pena-kuzco Apr 4, 2025
b4012c7
updated verification readme with extra stuff for sam. added minimal …
kyle-pena-kuzco Apr 8, 2025
efd7b29
updated docs
kyle-pena-kuzco Apr 9, 2025
68bd43a
fixed the mermaid diagram colors
kyle-pena-kuzco Apr 9, 2025
eb62074
added back example script
kyle-pena-kuzco Apr 9, 2025
616e90e
updated readme a little
kyle-pena-kuzco Apr 9, 2025
1899952
added a quick note
kyle-pena-kuzco Apr 9, 2025
5917383
a few more notes on phases
kyle-pena-kuzco Apr 9, 2025
983968c
added some notes on my mindset
kyle-pena-kuzco Apr 9, 2025
a40e33d
small update to words
kyle-pena-kuzco Apr 9, 2025
ab040fc
last small change
kyle-pena-kuzco Apr 9, 2025
a52f4e4
more words
kyle-pena-kuzco Apr 9, 2025
977f7f1
wrote some test scripts, will use these to capture stats
kyle-pena-kuzco Apr 12, 2025
21617d0
wrote a bunch of test scripts for toploc and some for replication as …
kyle-pena-kuzco Apr 13, 2025
1e02b57
updated fingerprint batch size to 100
kyle-pena-kuzco Apr 13, 2025
b823804
added prefill attack test script
kyle-pena-kuzco Apr 13, 2025
053aaee
added a bunch of scripts, re-worked replication testing flow
kyle-pena-kuzco Apr 14, 2025
1ed640b
script updates
kyle-pena-kuzco Apr 14, 2025
e23fb49
re-organized the toploc-scripts folder.
kyle-pena-kuzco Apr 16, 2025
7bb8947
changes to scripts. removed file that shouldn't be there
kyle-pena-kuzco Apr 16, 2025
75ad272
lots of tests for various verification methods
kyle-pena-kuzco Apr 17, 2025
5164baf
updates to NLL based experiments. getting close i think
kyle-pena-kuzco Apr 20, 2025
4ba2a34
added top-k post-hoc renormalization.
kyle-pena-kuzco Apr 22, 2025
4701e20
pointed data collection scripts at a clean branch. added temperature…
kyle-pena-kuzco Apr 22, 2025
06859b5
removed results from repo i mistakenly committed
kyle-pena-kuzco Apr 22, 2025
0c34902
changed gitignore
kyle-pena-kuzco Apr 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
meta-llama/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down Expand Up @@ -127,6 +129,7 @@ venv/
ENV/
env.bak/
venv.bak/
.sglang

# Spyder project settings
.spyderproject
Expand Down Expand Up @@ -227,3 +230,5 @@ compile_commands.json
.vscode

1
**.npz
ret.json
178 changes: 178 additions & 0 deletions STEP_BY_STEP_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
## Introduction

Since you want to get your hands dirty, here's a quick guide on how to work do the verification flow step by step.

I'd also encourage you to check out [the verification README](/VERIFICATION_README.md) for more context.

## Setup

First, create a virtual environment at the root of the repository.

Activate the environment and install sglang:

```
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python
pip install transformers==4.48.3
pip install datasets
```

You also need to set your `HF_TOKEN` environment variable to a token which has access to `meta-llama/3.1-8b-instruct`. You can find mine in 1Password under Engineering.

```
export HF_TOKEN=...
```

## Example Script

Try running this script:

```
python toploc-scripts/minimal_example.py --disable-cuda-graph
```

**Note**: I've disabled CUDA graph because it introduces some kind of non-determinism in the prefill that makes verification occassionally fail (maybe 1 out of 6 times). This is a new behavior compared to my testing from last week, so I'm hoping it's because I upgraded toploc to v0.1.4, and this is easily resolved. I've got a ticket to look into it.

## How to do it "By Hand"

First, you have to start the server in `--toploc-verification` mode.

Here is the command you can run to start the server:
```
python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 3001 --toploc-verification --toploc-verification-topk 128 --log-level debug --disable-cuda-graph
```

Now, you can send an inference request to the server, and you can see the fingerprint in the response:
```
import json
import openai

params = {
"temperature": 0,
"seed": 42,
}

client = openai.Client(base_url=f"http://127.0.0.1:3001/v1", api_key="None")

prompt = "What is the capital of Bulgaria?"
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "user", "content": prompt},
],
**params
)
response_dump = response.model_dump()
print("Response received:")
print(json.dumps(response_dump, indent=4))
```

The response will contain a `toploc_verification_fingerprints` array:
```json
{
"choices": [
{
"message": {
"content": "Sofia",
"toploc_verification_fingerprints": ["...", "..."]
}
}
]
}
```

There are typically two. We're only interested in the last one.

Now, we need to validate the fingerprint. How do we do that? By sending it to a verification instance running the same model, along with the original prompt and response.

To build the verification request, you have to:

1. Append to the messages array, so that it includes both the original prompt and the assistant's response:
```json
{
"role": "user",
"content": "What is the capital of Bulgaria?"
},
// This is the response ---v to this ----^
{
"role": "assistant",
"content": (the response)
}
```

2. Set `max_tokens` to 0. This is what makes it a prefill.

3. Set `toploc_verification_fingerprint_to_validate` to the last fingerprint in the `toploc_verification_fingerprints` array.

The verification instance will respond with a `toploc_verification_fingerprint_validation_result`, which will look something like this (but serialized as a string):

```json
{
"exp_mismatches": 1,
"mant_err_mean": 0.75,
"mant_err_median": 0.75,
}
```

These error statistics are what is interpreted to determine if this is a verification pass or verification failure.


The implementation of this fork would have been much simpler if we had worked with the SGLang module directly in Python (i.e.; `import sglang`), but that would have entailed basically rewriting how our workers work.

So, unfortunately, I had to devote a lot of code to pass-thrus to/from the API layer of SGLang.

**Important Note On Prefill Replication**

I am prefilling the original prompt + response by appending an assistant message to the messages array.

This may not work in all cases: i.e.; tools in the request for example.

Another concern is fragility. Suppose that SGLang changes the way it parses or generates responses, the model updates its chat template, etc etc. Then, the same messages array will not correspond to the same token ID inputs.

For both of these reasons, I've implemented two other features to make pre-fill more robust:
1. `return_input_ids` - returns the token IDs of the prompt if included in the request
2. `return_output_ids` - returns the token IDs of the response if included in the request

Then, pre-fill request will simply take:
`input_ids[:-1] + output_ids + EOT`, which is a far more reliable way to replicate prompt + response.

## How The Fork Works

It turns out the EAGLE speculative decoding has a lot in common with verification.

SGLang has an internal flag called `CaptureHiddenMode`, which has values of either `NONE`, `LAST`, or `FULL`.

These values refer to which of the hidden layers of the LLM should be "captured" so that when inference is complete, their values are accessible for use in EAGLE speculative sampling.

Ordinarily, `CaptureHiddenMode` is set to `NONE` unless some version of EAGLE is enabled.

I modified this logic so that verification is enabled, `CaptureHiddenMode` is set to at least `LAST`.

Then, after inference is complete, I move the last hidden layer to the CPU.

At this point, the code path diverges.

1. If we are performing inference, I use the hidden layer to generate the toploc fingerprint, and return it with the response.

2. If we are verifying a fingerprint, I compare the hidden layer with the toploc fingerprint, and return the result with the response.

The "core" logic of fingerprint verification and fingerprint generation are part of the toploc library, which I have added as a dependency.

I could have re-implemented it all from scratch because I understand the math, but that seemed like a wasteful exercise when we have a working implementation available.

## What Makes The Fork Tricky

SGLang takes requests and puts them into a general purpose task scheduler.

Then, SGLang attempts to take tasks of the same kind and group them into batches.

The batches store information in arrays, and in some cases the batch objects store nested data structures as flat arrays and then use array indices to set the boundaries between contiguous regions that represent individual items in the batch. There are also a few different kinds of objects that are at the batch level (`ScheduleBatch`, `BatchTokenIDOut`, `LogitProcessorOutput`, etc.)

So, there is quite a bit of "glue" required to correctly assemble the various kinds of batches and then slice them back apart into requests once inference is complete.

Then, there's additional layers of pass-thru to the API layer of SGLang.

However, there was plenty of precedent for how to do this kind of stuff by looking at the EAGLE code.

So, you'll see a lot of code that is basically the same as EAGLE code that lives right next to it, if you explore up or down a few lines. This is especially the case when it comes to handling `CaptureHiddenMode` and dealing with `hidden_states`.

This is also a divergent codepath for CUDA Graph Runner that needs to be properly handled.
Loading