-
Notifications
You must be signed in to change notification settings - Fork 39
dataset blending tool #433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
great! |
@tscholak @RaymondLi0 we already have all the machinery we need for blending datasets in The discovery implementation here looks at the yaml configs which cause complications because of all the ways they could be defined. A much simpler option would be to just look at the As for reading, it should be done by instantiating the dataset, it will take care of everything and avoid trouble. Lastly, note that almost every "tool" we've had so far quickly went out of sync and became useless. To prevent this script from suffering the same fate, we'll want to bring it inside |
|
Made the following updates:
|
✨ Description
Tool to recursively discover datasets in a directory and generate a blended dataset config.
This tool walks through a directory tree, identifies datasets by their fast_llm_config*.yaml files,
and generates a config file that blends all discovered datasets with weights proportional to token counts.
🔍 Type of change
Select all that apply: