Skip to content

Conversation

Martin-Labenne
Copy link

This PR improves SpeechToTextTool to ensure it works out of the box as expected, addressing issue #1478.

At a high level:

  • Explicitly pass the correct whisper sampling_rate to the pre-processor and ensure the attention_mask is properly generated then passed through to the model’s forward method to avoid unreliable behavior during inference.
  • Introduces explicit language control when instantiating the class (ex. language='en') while preserving the option for auto-detection when desired (language is omitted or None).
  • Ensures audio inputs are resampled to the expected model sampling rate to avoid silent transcription failures.
  • Adds support for longer audio files to to get transcription completeness.

Note: This is my first contribution to an open-source project, I'm excited to be part of it and open to any feedback or suggestions. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant