-
Notifications
You must be signed in to change notification settings - Fork 0
Live transcription #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…th custom key bindings and toolbar
…ance interrupt timer functionality
…es and update related templates and tests
…-related commands and cleaning up unnecessary sections
…y and functionality
…in transcription process
- Introduced `live_types.py` to define data structures for audio frames, speech chunks, transcript segments, and dashboard events. - Implemented `vad_chunker.py` for voice activity detection and chunking of audio streams. - Updated README with debugging instructions for live transcription. - Created `debug_live_transcript.py` for inspecting audio chunks. - Added `test_live_vad.py` to test VAD functionality with audio capture. - Enhanced `test_device_manager.py` to include tests for aggregate device detection. - Developed `test_live_transcriber.py` to validate live transcription events and exports. - Modified `test_settings.py` to reflect updated default settings. - Added `test_vad_chunker.py` to ensure VAD chunker emits chunks correctly. - Updated `whisper_transcriber.py` to support segment callbacks during transcription. - Enhanced `batch_processor.py` to allow segment callbacks during batch processing. - Updated dependency management in `uv.lock` for new packages and versions.
📦 Package Published to TestPyPIVersion: 🧪 This is a test release. Install with: pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ chirp-notes-ai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
| def _process_chunk(self, chunk: SpeechChunk): | ||
| self._publish_event("chunk", {"duration": chunk.end - chunk.start}) | ||
| self._pcm_buffer.extend(chunk.data) | ||
| self._last_chunk_end = max(self._last_chunk_end, chunk.end) | ||
|
|
||
| self._maybe_transcribe(force=False) | ||
|
|
||
| def _publish_event(self, event_type: str, payload: dict): | ||
| event = DashboardEvent(type=event_type, payload=payload) | ||
| try: | ||
| self.event_queue.put_nowait(event) | ||
| except queue.Full: | ||
| pass | ||
|
|
||
| @staticmethod | ||
| def _convert_chunk_to_array(chunk_bytes: bytes) -> np.ndarray: | ||
| if not chunk_bytes: | ||
| return np.array([], dtype=np.float32) | ||
| pcm = np.frombuffer(chunk_bytes, dtype=np.int16).astype(np.float32) | ||
| if pcm.size == 0: | ||
| return np.array([], dtype=np.float32) | ||
| normalized = pcm / 32768.0 | ||
| return np.ascontiguousarray(normalized, dtype=np.float32) | ||
|
|
||
| @staticmethod | ||
| def _resample_audio( | ||
| audio: np.ndarray, original_rate: int, target_rate: int | ||
| ) -> np.ndarray: | ||
| if original_rate == target_rate or audio.size == 0: | ||
| return audio | ||
| duration = audio.shape[0] / float(original_rate) | ||
| target_length = max(1, int(round(duration * target_rate))) | ||
| x_old = np.linspace(0, duration, num=audio.shape[0], endpoint=False) | ||
| x_new = np.linspace(0, duration, num=target_length, endpoint=False) | ||
| resampled = np.interp(x_new, x_old, audio) | ||
| return np.ascontiguousarray(resampled.astype(np.float32)) | ||
|
|
||
| def _maybe_transcribe(self, force: bool): | ||
| if not self._pcm_buffer: | ||
| return | ||
|
|
||
| if not force and self.transcription_interval > 0: | ||
| if ( | ||
| self._last_chunk_end - self._last_transcribe_at | ||
| < self.transcription_interval | ||
| ): | ||
| return | ||
|
|
||
| pcm_bytes = bytes(self._pcm_buffer) | ||
|
|
||
| with tempfile.NamedTemporaryFile( | ||
| suffix=".wav", | ||
| delete=False, | ||
| dir="/tmp" if Path("/tmp").exists() else None, | ||
| ) as tmp: | ||
| temp_path = Path(tmp.name) | ||
| with wave.open(tmp, "wb") as fh: | ||
| fh.setnchannels(1) | ||
| fh.setsampwidth(2) | ||
| fh.setframerate(self.sample_rate) | ||
| fh.writeframes(pcm_bytes) | ||
|
|
||
| try: | ||
| result = self.transcriber.transcribe_file( | ||
| temp_path, | ||
| fast_mode=True, | ||
| language=self._language, | ||
| ) | ||
| finally: | ||
| if temp_path.exists(): | ||
| temp_path.unlink(missing_ok=True) | ||
|
|
||
| metadata = result.get("metadata", {}) | ||
| if metadata and metadata.get("language") and not self._language: | ||
| self._language = metadata.get("language") | ||
|
|
||
| segments = result.get("segments", []) | ||
| new_segments: list[TranscriptSegment] = [] | ||
|
|
||
| max_end = self._last_chunk_end | ||
| for seg in segments: | ||
| text = seg.get("text", "").strip() | ||
| if not text: | ||
| continue | ||
| start = float(seg.get("start", 0.0)) | ||
| end = float(seg.get("end", start)) | ||
|
|
||
| absolute_start = self._buffer_offset_seconds + start | ||
| absolute_end = self._buffer_offset_seconds + end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use actual chunk timestamps when computing transcript offsets
The live transcriber builds absolute timestamps from self._buffer_offset_seconds (the length of audio already sent to Whisper) and ignores the real start time of each SpeechChunk. Because the VAD chunker strips silence before queuing a chunk, self._buffer_offset_seconds only advances by speech duration, so any gap between chunks is dropped. For example, if a user speaks for 1s, stays silent for 5s and then speaks again, the second segment will be emitted around 2s after recording started instead of ~6s. This causes the dashboard and exported transcript to drift whenever there is silence. Track the actual wall‑clock offset (e.g. the chunk’s start time) when appending to the buffer and base absolute_start/absolute_end on that instead of accumulated audio length.
Useful? React with 👍 / 👎.
No description provided.