From d0f313c5e9b0d362a8b4c29628ff0cdcf7e7bace Mon Sep 17 00:00:00 2001 From: Sinclair Hudson Date: Wed, 10 Apr 2024 15:43:19 -0400 Subject: [PATCH 1/2] initial sketch of testing documentation --- README.md | 5 +++-- testing.md | 49 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 52 insertions(+), 2 deletions(-) create mode 100644 testing.md diff --git a/README.md b/README.md index b4cc493..e5006d8 100644 --- a/README.md +++ b/README.md @@ -16,7 +16,7 @@ LLM Finetuning toolkit is a config-based CLI tool for launching a series of LLM ### pipx (recommended) -pipx installs the package and depdencies in a seperate virtual environment +pipx installs the package and dependencies in a separate virtual environment ```shell pipx install llm-toolkit @@ -153,7 +153,7 @@ lora: lora_dropout: 0.25 ``` -#### Quality Assurance +#### LLM testing ```yaml qa: @@ -163,6 +163,7 @@ qa: ``` - To ensure that the fine-tuned LLM behaves as expected, you can add tests that check if the desired behaviour is being attained. Example: for an LLM fine-tuned for a summarization task, we may want to check if the generated summary is indeed smaller in length than the input text. We would also like to learn the overlap between words in the original text and generated summary. +- For more information and guidance on LLM testing, see our [LLM Testing Guidebook](testing.md) #### Artifact Outputs diff --git a/testing.md b/testing.md new file mode 100644 index 0000000..7b4b57a --- /dev/null +++ b/testing.md @@ -0,0 +1,49 @@ +## LLM Testing Guidebook + +Below, we outline an initial guide to testing LLMs. +LLMs are some of the hardest software systems to test, and have some unique challenges when compared to other ML systems. + +## Motivation + +Companies can be held [liable for their chatbot's outputs](https://www.cbc.ca/news/canada/british-columbia/air-canada-chatbot-lawsuit-1.7116416). +Additionally, LLMs are expensive. It's important that every interaction is productive. +Chatbots or other LLMs that are public facing can be subject to many different attacks, including +[prompt injection attacks](https://www.reddit.com/r/ChatGPT/comments/18lxai7/prompt_injection_challenge_chevrolet_of/). + +## LLM Testing Difficulties + +LLM testing is extremely difficult, and at the time of writing there is no agreed upon "best practice". + +1. Unrestricted output - users can give the chatbot any instruction in the form of text +2. Unrestricted input - The LLM outputs human-readable text, turning verifying certain properties into a natural language understanding problems. +3. Original training data and procedure unknown - it's impossible to make claims about what the model has been trained on, and re-training from scratch is not feasible. +4. Inference is expensive - With billions of parameters, any test run is expensive, and some testing techniques become computationally infeasible. + +## Testing Properties + +* Correctness + * Factual correctness - does the output contain strictly factual information? + * Stylistic correctness - does the model use a helpful and pleasant tone? + * Structural correctness - does the model's output follow a certain structure, like JSON or YAML? +* Privacy - Does the LLM leak sensitive or private information? +* Security - Is the LLM able to avoid prompt injection attacks? +* Robustness - Does the LLM's behaviour change when extra spaces are added, or when the input is worded differently? +* Fairness - The model's outputs are fair, with respect to gender, race, etc. and is equally helpful to all users +* Model Relevance - Will the model perform well on the data it will encounter in production? Is it overfit? Underfit? + +## Where to test + +* Data Testing +* Learning Program (Model) Testing + +## Debug and Repair + +How do we fix our LLM, if it has undesirable behaviour? + +1. Prompt - sometimes +2. External Guardrails +3. Finetuning + + + + From 352090468c0d379df74a8b64b807424ade6b594e Mon Sep 17 00:00:00 2001 From: SinclairHudson Date: Wed, 8 May 2024 00:19:08 -0400 Subject: [PATCH 2/2] adding more to testing documentation --- testing.md | 42 ++++++++++++++++++++++++++---------------- 1 file changed, 26 insertions(+), 16 deletions(-) diff --git a/testing.md b/testing.md index 7b4b57a..dbc4322 100644 --- a/testing.md +++ b/testing.md @@ -3,46 +3,56 @@ Below, we outline an initial guide to testing LLMs. LLMs are some of the hardest software systems to test, and have some unique challenges when compared to other ML systems. -## Motivation +## Motivation: Why should we test LLMs? Companies can be held [liable for their chatbot's outputs](https://www.cbc.ca/news/canada/british-columbia/air-canada-chatbot-lawsuit-1.7116416). Additionally, LLMs are expensive. It's important that every interaction is productive. -Chatbots or other LLMs that are public facing can be subject to many different attacks, including +Chatbots or other LLMs that are public-facing can be subject to many different attacks, including [prompt injection attacks](https://www.reddit.com/r/ChatGPT/comments/18lxai7/prompt_injection_challenge_chevrolet_of/). -## LLM Testing Difficulties - -LLM testing is extremely difficult, and at the time of writing there is no agreed upon "best practice". +## LLM Testing Difficulties: Why is testing hard? 1. Unrestricted output - users can give the chatbot any instruction in the form of text 2. Unrestricted input - The LLM outputs human-readable text, turning verifying certain properties into a natural language understanding problems. 3. Original training data and procedure unknown - it's impossible to make claims about what the model has been trained on, and re-training from scratch is not feasible. 4. Inference is expensive - With billions of parameters, any test run is expensive, and some testing techniques become computationally infeasible. -## Testing Properties +## Testing Properties: What should be tested? * Correctness * Factual correctness - does the output contain strictly factual information? * Stylistic correctness - does the model use a helpful and pleasant tone? * Structural correctness - does the model's output follow a certain structure, like JSON or YAML? * Privacy - Does the LLM leak sensitive or private information? -* Security - Is the LLM able to avoid prompt injection attacks? +* Security - Is the LLM able to avoid prompt injection attacks, or other attempts to illicit problematic responses? * Robustness - Does the LLM's behaviour change when extra spaces are added, or when the input is worded differently? -* Fairness - The model's outputs are fair, with respect to gender, race, etc. and is equally helpful to all users +* Fairness - Are model's outputs fair with respect to gender, race, etc., and is equally helpful to all users? * Model Relevance - Will the model perform well on the data it will encounter in production? Is it overfit? Underfit? -## Where to test +These properties can be tested in **multiple different ways**. See our [test documentation]() for pre-built tests and examples on how to use them. + +## General Strategies + +LLM testing is extremely difficult, and at the time of writing there is no agreed upon "best practice". -* Data Testing -* Learning Program (Model) Testing +#### 1. Build a focused test suite -## Debug and Repair +LLM testing can become very expensive very quickly; inference is always expensive and sometimes a second language model is needed to evaluate the first. +As such, **focus on use-cases and failure cases that are mission-critical**, and focus on testing **the limit of the model's capabilities**. +A lot of benchmarks and tests online consist of fairly easy questions that modern LLMs always get right, possibly because the questions were included in training data at some point. +It's also unlikely that there will be a benchmark that tests for exactly the kinds of queries your customers might bring to the LLM. +It's important to curate a set of tests specific to your specific use-case, and review/modify that suite should requirements change. -How do we fix our LLM, if it has undesirable behaviour? +#### 2. Guardrails -1. Prompt - sometimes -2. External Guardrails -3. Finetuning +Inevitably, testing will surface undesirable behaviour, and solving these bugs becomes the next objective. +Debugging an LLM itself can be difficult, expensive, and slow. +Traditional finetuning and parameter-efficient finetuning can yield mixed results. +Sometimes, it's more efficient and practical to simply filter malicious prompts before they get to the LLM, and/or filter out poor responses before they get to the customer. +For example, a chatbot at a car dealership could have a simple NLP model to classify a user's query as relevant or irrelevant to the vehicles for sale. +In the case of the irrelevant prompt, the chatbot doesn't employ an LLM to respond, and simply states that the query is irrelevant. +Likewise, a chatbot could have its response naively checked for blacklisted words, and abort if the response would have contained foul language. +We call these checks on the input/output of the model "guardrails", and though they aren't as elegant as retraining the model, they can be quite effective.