Skip to content

Commit c42b85c

Browse files
authored
Merge pull request #14 from codefortulsa/sub-titles
Create subtitles from transcripts
2 parents 36109a1 + 716bc8a commit c42b85c

14 files changed

+2512
-1262
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,10 @@ models/
66
# Include specific directories
77
!src/models/
88

9+
# Jupyter notebook
910
notebooks/.ipynb_checkpoints/
11+
.ipynb_checkpoints/
12+
*/.ipynb_checkpoints/*
1013

1114
# Python
1215
__pycache__/
@@ -47,4 +50,3 @@ build/
4750
npm-debug.log*
4851
yarn-debug.log*
4952
yarn-error.log*
50-

.pre-commit-config.yaml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
repos:
2+
- repo: https://github.com/kynan/nbstripout
3+
rev: 0.7.1
4+
hooks:
5+
- id: nbstripout
6+
name: Strip Jupyter notebook output cells
7+
description: Clear output from Jupyter notebooks before committing
8+
files: \.ipynb$
9+
10+
- repo: https://github.com/pre-commit/pre-commit-hooks
11+
rev: v4.5.0
12+
hooks:
13+
- id: trailing-whitespace
14+
- id: end-of-file-fixer
15+
- id: check-yaml
16+
- id: check-added-large-files
17+
args: ['--maxkb=500']

NOTEBOOK_GUIDELINES.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Jupyter Notebook Guidelines
2+
3+
## Automatic Output Stripping
4+
5+
This repository is configured with a pre-commit hook that automatically strips output cells from Jupyter notebooks before they are committed to Git. This helps keep the repository size manageable by avoiding the storage of large outputs such as images, graphs, and videos in the Git history.
6+
7+
### How It Works
8+
9+
1. The `nbstripout` pre-commit hook is configured to run automatically before each commit.
10+
2. It removes all output cells, execution counts, and metadata from notebooks.
11+
3. Your notebook file will be stripped only in the Git repository - your local file will keep its outputs.
12+
13+
### Setup for New Contributors
14+
15+
If you're newly cloning this repository, you need to set up the pre-commit hooks:
16+
17+
```bash
18+
# Install poetry dependencies including pre-commit tools
19+
poetry install
20+
21+
# Install the pre-commit hooks
22+
poetry run pre-commit install
23+
```
24+
25+
### Testing the Setup
26+
27+
To verify that the pre-commit hooks are working correctly, you can run:
28+
29+
```bash
30+
poetry run pre-commit run --all-files
31+
```
32+
33+
### Manual Stripping
34+
35+
If you need to manually strip outputs from a notebook, run:
36+
37+
```bash
38+
poetry run nbstripout notebooks/your_notebook.ipynb
39+
```
40+
41+
## Best Practices
42+
43+
1. **Keep Large Data Outside Git**: Store large datasets separately (e.g., data/ directory which is gitignored).
44+
2. **Avoid Embedding Large Files**: Don't embed videos, large images, or other binary data directly in notebooks.
45+
3. **Document Data Sources**: Always include information on how to obtain data needed for your notebooks.
46+
4. **Separate Code and Content**: Use markdown cells to document your analysis thoroughly.
47+
48+
## Troubleshooting
49+
50+
If you encounter issues with the pre-commit hooks, ensure:
51+
- You have run `poetry install` to install all dependencies
52+
- You have run `poetry run pre-commit install` to set up the hooks
53+
- You are committing from within the Poetry environment or using `poetry run git commit`

notebooks/meetings.ipynb

Lines changed: 247 additions & 599 deletions
Large diffs are not rendered by default.

notebooks/roll_call.ipynb

Lines changed: 7 additions & 147 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
},
1212
{
1313
"cell_type": "code",
14-
"execution_count": 5,
14+
"execution_count": null,
1515
"metadata": {},
1616
"outputs": [],
1717
"source": [
@@ -31,17 +31,9 @@
3131
},
3232
{
3333
"cell_type": "code",
34-
"execution_count": 6,
34+
"execution_count": null,
3535
"metadata": {},
36-
"outputs": [
37-
{
38-
"name": "stdout",
39-
"output_type": "stream",
40-
"text": [
41-
"Clip successfully extracted to: ../data/video/regular_council_meeting___2025_02_26_clip_4-50_to_5-20.mp4\n"
42-
]
43-
}
44-
],
36+
"outputs": [],
4537
"source": [
4638
"import subprocess\n",
4739
"from pathlib import Path\n",
@@ -97,126 +89,9 @@
9789
},
9890
{
9991
"cell_type": "code",
100-
"execution_count": 7,
92+
"execution_count": null,
10193
"metadata": {},
102-
"outputs": [
103-
{
104-
"name": "stderr",
105-
"output_type": "stream",
106-
"text": [
107-
"INFO:src.videos:Transcribing video with speaker diarization: ../data/video/regular_council_meeting___2025_02_26_clip_4-50_to_5-20.mp4\n",
108-
"INFO:src.videos:Output will be saved to: ../data/transcripts/regular_council_meeting___2025_02_26_clip_4-50_to_5-20.diarized.json\n",
109-
"INFO:src.huggingface:Auto-detected device: cpu\n",
110-
"INFO:src.huggingface:Auto-selected compute_type: int8\n",
111-
"INFO:src.huggingface:Loading WhisperX model: tiny on cpu with int8 precision\n"
112-
]
113-
},
114-
{
115-
"data": {
116-
"application/vnd.jupyter.widget-view+json": {
117-
"model_id": "168afa65d3ae4108af591eb1993fe482",
118-
"version_major": 2,
119-
"version_minor": 0
120-
},
121-
"text/plain": [
122-
"tokenizer.json: 0%| | 0.00/2.20M [00:00<?, ?B/s]"
123-
]
124-
},
125-
"metadata": {},
126-
"output_type": "display_data"
127-
},
128-
{
129-
"data": {
130-
"application/vnd.jupyter.widget-view+json": {
131-
"model_id": "89d35faecb8e447db3ccb95407e2a775",
132-
"version_major": 2,
133-
"version_minor": 0
134-
},
135-
"text/plain": [
136-
"config.json: 0%| | 0.00/2.25k [00:00<?, ?B/s]"
137-
]
138-
},
139-
"metadata": {},
140-
"output_type": "display_data"
141-
},
142-
{
143-
"data": {
144-
"application/vnd.jupyter.widget-view+json": {
145-
"model_id": "f616039556ee46aaaee2f975f016aeb0",
146-
"version_major": 2,
147-
"version_minor": 0
148-
},
149-
"text/plain": [
150-
"vocabulary.txt: 0%| | 0.00/460k [00:00<?, ?B/s]"
151-
]
152-
},
153-
"metadata": {},
154-
"output_type": "display_data"
155-
},
156-
{
157-
"data": {
158-
"application/vnd.jupyter.widget-view+json": {
159-
"model_id": "50bd4e88d6084638b91847587cc9ed0a",
160-
"version_major": 2,
161-
"version_minor": 0
162-
},
163-
"text/plain": [
164-
"model.bin: 0%| | 0.00/75.5M [00:00<?, ?B/s]"
165-
]
166-
},
167-
"metadata": {},
168-
"output_type": "display_data"
169-
},
170-
{
171-
"name": "stderr",
172-
"output_type": "stream",
173-
"text": [
174-
"Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../../Library/Caches/pypoetry/virtualenvs/tgov_scraper-zRR99ne3-py3.11/lib/python3.11/site-packages/whisperx/assets/pytorch_model.bin`\n",
175-
"INFO:src.huggingface:Loading diarization pipeline\n"
176-
]
177-
},
178-
{
179-
"name": "stdout",
180-
"output_type": "stream",
181-
"text": [
182-
"No language specified, language will be first be detected for each audio file (increases inference time).\n",
183-
">>Performing voice activity detection using Pyannote...\n",
184-
"Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.\n",
185-
"Model was trained with torch 1.10.0+cu102, yours is 2.4.1. Bad things might happen unless you revert torch to 1.x.\n"
186-
]
187-
},
188-
{
189-
"name": "stderr",
190-
"output_type": "stream",
191-
"text": [
192-
"INFO:src.huggingface:WhisperX model loaded in 4.50 seconds\n",
193-
"INFO:src.videos:Running initial transcription with batch size 8...\n"
194-
]
195-
},
196-
{
197-
"name": "stdout",
198-
"output_type": "stream",
199-
"text": [
200-
"Detected language: en (0.99) in first 30s of audio...\n"
201-
]
202-
},
203-
{
204-
"name": "stderr",
205-
"output_type": "stream",
206-
"text": [
207-
"INFO:src.videos:Detected language: en\n",
208-
"INFO:src.videos:Loading alignment model for detected language: en\n",
209-
"INFO:src.videos:Aligning transcription with audio...\n",
210-
"INFO:src.videos:Running speaker diarization...\n",
211-
"/Users/owner/Library/Caches/pypoetry/virtualenvs/tgov_scraper-zRR99ne3-py3.11/lib/python3.11/site-packages/pyannote/audio/models/blocks/pooling.py:104: UserWarning: std(): degrees of freedom is <= 0. Correction should be strictly less than the reduction factor (input numel divided by output numel). (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/ReduceOps.cpp:1808.)\n",
212-
" std = sequences.std(dim=-1, correction=1)\n",
213-
"INFO:src.videos:Assigning speakers to transcription...\n",
214-
"INFO:src.videos:Processing transcription segments...\n",
215-
"INFO:src.videos:Diarized transcription completed in 30.03 seconds\n",
216-
"INFO:src.videos:Detailed JSON saved to: ../data/transcripts/regular_council_meeting___2025_02_26_clip_4-50_to_5-20.diarized.json\n"
217-
]
218-
}
219-
],
94+
"outputs": [],
22095
"source": [
22196
"from src.videos import transcribe_video_with_diarization\n",
22297
"\n",
@@ -231,24 +106,9 @@
231106
},
232107
{
233108
"cell_type": "code",
234-
"execution_count": 8,
109+
"execution_count": null,
235110
"metadata": {},
236-
"outputs": [
237-
{
238-
"data": {
239-
"application/vnd.jupyter.widget-view+json": {
240-
"model_id": "5d97ff70c1c3409da83c10c478f2bfaa",
241-
"version_major": 2,
242-
"version_minor": 0
243-
},
244-
"text/plain": [
245-
"HTML(value='<h3>Meeting Script</h3><hr><p><b>[00:00:00] SPEAKER_01:</b><br>Thank you, Mr. Huffinds. Any counci…"
246-
]
247-
},
248-
"metadata": {},
249-
"output_type": "display_data"
250-
}
251-
],
111+
"outputs": [],
252112
"source": [
253113
"def format_timestamp(seconds: float) -> str:\n",
254114
" \"\"\"Convert seconds to HH:MM:SS format\"\"\"\n",

0 commit comments

Comments
 (0)