Skip to content

Source code of our paper "When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models"

Notifications You must be signed in to change notification settings

lca0503/AudioInterference

Repository files navigation

When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

Overview

  • Goal: Assess how irrelevant audio interferes with textual reasoning across audio-language models.
  • Benchmarks: GSM8K, MMLU, ARC-Challenge.
  • Settings:
    • text_bench: text-only baseline
    • text_bench_interference: text + irrelevant audio

Public datasets used in the paper are on the Hugging Face Hub:

Setup

  1. Clone and enter repo
git clone https://github.com/lca0503/AudioInterference.git
cd AudioInterference
  1. Install dependencies
pip install -r requirements.txt

Build datasets (optional)

You can construct the interference datasets by pairing each test question with a random audio file (.wav). Provide a folder of audio files and a target Hub repo name.

Example:

python build_dataset/gsm8k.py \
  --audio_path /path/to/wavs \
  --repo_name your-username/audio_interference_gsm8k \
  --seed 0
python build_dataset/mmlu.py \
  --audio_path /path/to/wavs \
  --repo_name your-username/audio_interference_mmlu \
  --seed 0

Utility generators (optional):

  • Gaussian noise
python build_dataset/generate_noise.py \
  --output_dir ./noise_wavs \
  --num_noise 1000 \
  --sampling_rate 16000 \
  --duration 5 \
  --sigma 0.01 \
  --seed 0
  • Silence
python build_dataset/generate_silence.py \
  --output_dir ./silence_wavs \
  --num_silence 1000 \
  --sampling_rate 16000 \
  --duration 5

Inference

Common arguments

  • --task_id: one of audio_interference_gsm8k, audio_interference_mmlu, audio_interference_arc_challenge
  • --task_split: one of silence, noise, fsd, etc.
  • --task_type: text_bench or text_bench_interference
  • --mitigate_prompt: add this flag to use the mitigation prompt
  • --output_path: JSONL path to save results

Results are saved as JSONL with fields like subject, task, prompt, query, choices, response, answer.

Examples:

  • Qwen2.5-Omni (via vLLM)
python inference_qwen25omni.py \
  --task_id lca0503/audio_interference_arc_challenge \
  --task_split silence \
  --task_type text_bench_interference \
  --model_id Qwen/Qwen2.5-Omni-7B \
  --output_path outputs/qwen25omni/silence_arc_challenge.jsonl \
  --temperature 0 \
  --seed 0
  • Phi-4-multimodal-instruct (via vLLM + LoRA)
python inference_phi4mm.py \
  --task_id lca0503/audio_interference_gsm8k \
  --task_split noise \
  --task_type text_bench_interference \
  --model_id microsoft/Phi-4-multimodal-instruct \
  --output_path outputs/phi4mm/noise_gsm8k.jsonl \
  --temperature 0 \
  --seed 0
  • Voxtral (Mistral-format via vLLM)
python inference_voxtral.py \
  --task_id lca0503/audio_interference_mmlu \
  --task_split fsd \
  --task_type text_bench_interference \
  --model_id mistralai/Voxtral-Mini-3B-2507 \
  --output_path outputs/voxtralmini/fsd_mmlu.jsonl \
  --seed 0
python inference_desta.py \
  --task_id lca0503/audio_interference_mmlu \
  --task_split fsd \
  --task_type text_bench_interference \
  --model_id DeSTA-ntu/DeSTA2.5-Audio-Llama-3.1-8B \
  --output_path outputs/desta25/fsd_mmlu.jsonl \
  --seed 0

Notes

  • Adjust CUDA/vLLM configs per your hardware (GPU memory, max_model_len, etc.).
  • For text-only baselines, set --task_type text_bench.

Evaluation

Accuracy

Compute accuracy from JSONL results.

Example:

python evaluate.py \
  --input_path outputs/qwen25omni/silence_arc_challenge.jsonl \
  --task_id arc
python evaluate.py \
  --input_path outputs/phi4mm/noise_gsm8k.jsonl \
  --task_id gsm8k

Self-consistency (majority vote over multiple responses):

python evaluate.py --input_path your.jsonl --task_id mmlu --scs

When using self-consistency, ensure each sample's response in the JSONL is a list of strings.

Influence rate (compare vs. text_bench)

Compute inconsistency rate (IR) between an interference run and the corresponding text-only baseline.

python influence_rate.py \
  --input_path outputs/qwen25omni/silence_mmlu.jsonl \
  --target_path outputs/qwen25omni/mmlu.jsonl \
  --task_id mmlu
python influence_rate.py \
  --input_path outputs/phi4mm/noise_gsm8k.jsonl \
  --target_path outputs/phi4mm/gsm8k.jsonl \
  --task_id gsm8k

Self-consistency (majority vote over multiple responses):

python influence_rate.py --input_path interfered.jsonl --target_path text.jsonl --task_id mmlu --scs

Citation

If you find our code or models helpful, please consider citing our paper using the following BibTeX:

@article{li2025silence,
  title={When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models},
  author={Li, Chen-An and Lin, Tzu-Han and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2510.00626},
  year={2025}
}

About

Source code of our paper "When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages