Use `--log_samples` when calling harness and upload them in separate repo for later diagnostics: See: https://github.com/EleutherAI/lm-evaluation-harness/issues/1842