feat(multimodal): add Video support for Gemini with structured outputs #1851

devin-ai-integration · 2025-10-16T12:46:52Z

feat(multimodal): add Video support for Gemini with structured outputs

Summary

This PR adds full video support to Instructor for use with Google's Gemini models, addressing issue #1850. The implementation includes:

New Video classes: Added Video and VideoWithGenaiFile classes to handle video file upload and processing
Multimodal integration: Video support integrated into existing multimodal pipeline alongside Image, Audio, and PDF
Updated documentation: Refreshed the multimodal-gemini blog post with current API (from_provider instead of deprecated from_gemini)
Working example: Complete video extraction example in examples/video-extraction-gemini/

The Video class follows the same pattern as existing multimodal classes (Image, Audio, PDF) and supports:

Multiple video formats (MP4, MOV, WebM, AVI, etc.)
File upload to Gemini API via VideoWithGenaiFile.from_new_genai_file()
Google Cloud Storage URLs (gs://)
Base64 encoded video data
Automatic MIME type detection

Review & Testing Checklist for Human

⚠️ CRITICAL - This code has NOT been tested with actual video files or API calls

Test video upload end-to-end: Run examples/video-extraction-gemini/run.py with a real video file to verify the upload flow works
Verify Gemini model supports video: Confirm that gemini-2.0-flash-exp (or appropriate model) actually supports video inputs - the blog post title mentions "2.5 Pro" but code uses "2.0-flash-exp"
Check API compatibility: Verify that VideoWithGenaiFile.from_new_genai_file() correctly uploads videos to Gemini API (modeled after PDFWithGenaiFile but untested)
Test structured extraction: Confirm that structured outputs work correctly with video content (the response_model pattern)
Review MIME types: Validate that the video MIME types in VALID_VIDEO_MIME_TYPES are actually supported by Gemini

Recommended Test Plan

Set GOOGLE_API_KEY environment variable
Download a sample travel video (e.g., from YouTube)
Run: python examples/video-extraction-gemini/run.py sample_video.mp4
Verify the video uploads successfully
Confirm structured data is extracted correctly
Test the blog post code snippets independently

Notes

The implementation follows the same pattern as PDFWithGenaiFile for file uploads
Video support is Gemini-only (OpenAI and Anthropic raise NotImplementedError as they don't support video)
All linting checks pass
Resolves issue Support for Video with Structured Output (Gemini) #1850

Session: https://app.devin.ai/sessions/68339918c75c44ac8f6e3813937a0517
Requested by: Jason Liu (@jxnl)

Important

Adds video support for Gemini models with new Video classes, integrating into the multimodal pipeline and updating documentation and examples.

Behavior:
- Adds Video and VideoWithGenaiFile classes in multimodal.py for video file handling and processing.
- Integrates video support into the multimodal pipeline alongside Image, Audio, and PDF.
- Supports multiple video formats (MP4, MOV, WebM, AVI, etc.), Google Cloud Storage URLs, and Base64 encoded data.
- Automatic MIME type detection for videos.
Documentation:
- Updates multimodal-gemini.md to use from_provider instead of from_gemini.
- Adds a complete video extraction example in examples/video-extraction-gemini/.
Misc:
- Adds Video and VideoWithGenaiFile to __init__.py exports.

^{This description was created by}^{for 6b0236d. You can customize this summary. It will automatically update as commits are pushed.}

- Add Video and VideoWithGenaiFile classes to multimodal.py - Support video file upload and processing with Gemini API - Export Video classes from instructor package - Update multimodal-gemini blog post with current API - Add complete video extraction example with Gemini 2.5 Pro - Update autodetect_media to handle video files - Support multiple video formats (MP4, MOV, WebM, etc.) Resolves #1850 Co-Authored-By: Jason Liu <jason@jxnl.co>

devin-ai-integration · 2025-10-16T12:47:00Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

ellipsis-dev

Caution

Changes requested ❌

Reviewed everything up to 6b0236d in 2 minutes and 18 seconds. Click for details.

Reviewed 722 lines of code in 5 files
Skipped 0 files when reviewing.
Skipped posting 8 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. instructor/processing/multimodal.py:540

Draft comment:
Avoid using 'assert' for MIME type validation; use explicit exception raising so that error checks aren’t skipped in optimized mode.
Reason this comment was not posted:
Confidence changes required: 80% <= threshold 85% None

2. instructor/processing/multimodal.py:557

Draft comment:
Avoid using 'assert' for MIME type checks in from_path(); prefer raising a ValueError for clearer error handling.
Reason this comment was not posted:
Confidence changes required: 80% <= threshold 85% None

3. instructor/processing/multimodal.py:611

Draft comment:
Ensure 'self.data' is not None before performing base64 decoding in to_genai() to avoid runtime errors.
Reason this comment was not posted:
Confidence changes required: 80% <= threshold 85% None

4. instructor/__init__.py:4

Draft comment:
Typographical Notice: The new import includes VideoWithGenaiFile. Considering the commit title mentions Gemini, please verify if the class name should be VideoWithGeminiFile for consistency.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 85% The comment makes an assumption based on a commit title without full context. Looking at the code, there's a clear distinction between Gemini and Genai providers. VideoWithGenaiFile likely refers to Google's Genai specifically, not Gemini. The name appears intentional and correct. Additionally, suggesting renames based on commit messages rather than code context violates our review rules. I could be wrong about the distinction between Genai and Gemini - they might be the same thing with different naming conventions. The commit message might have important context I'm missing. Even if Genai and Gemini are related, the codebase clearly uses both terms distinctly in different places. We should trust the author's naming choice without more context. Delete this comment. The name appears intentional and correct within the codebase's naming patterns, and suggesting renames based on commit messages rather than code violates our review rules.

5. instructor/__init__.py:42

Draft comment:
Typo check: The identifier "VideoWithGenaiFile" might be intended to be "VideoWithGenAIFile" for consistent capitalization. Please verify if this naming is correct.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 85% While the comment points out a potential naming inconsistency, without seeing the actual multimodal.py file where this class is defined, we can't be certain if this is actually inconsistent with the codebase's conventions. The comment is also just asking for verification rather than definitively stating a problem. Many codebases use "ai" lowercase in their naming. I might be too quick to dismiss this - inconsistent capitalization across a codebase can lead to confusion. Also, "AI" is very commonly capitalized in most codebases. However, one of our key rules is to not make speculative comments or ask for verification. This comment starts with "Please verify" which is explicitly against our rules. The comment should be deleted because it's asking for verification rather than definitively pointing out an issue, and we don't have enough context to be certain about the codebase's naming conventions.

6. instructor/processing/multimodal.py:47

Draft comment:
Typo detection: The MIME type 'video/mov' may be incorrect. Typically, the official MIME type for QuickTime movies is 'video/quicktime'. Consider removing or correcting 'video/mov' to avoid duplication with 'video/quicktime'.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 85% While 'video/quicktime' is indeed the official MIME type for QuickTime movies, 'video/mov' is also commonly used in practice. Having both provides better compatibility. The comment is technically correct about the official standard, but removing 'video/mov' could break compatibility with systems that use this unofficial but common MIME type. The comment focuses on technical correctness but ignores practical compatibility concerns. Also, having both MIME types isn't necessarily harmful. While the comment is technically accurate about the official MIME type, keeping both variants improves real-world compatibility without causing issues. The comment should be deleted because removing 'video/mov' could reduce compatibility, and having both MIME types doesn't cause problems.

7. instructor/processing/multimodal.py:508

Draft comment:
Typo in the autodetect_safely docstring: the parameter type is written as 'Union[str,path]'. Consider adding a space after the comma, e.g., 'Union[str, Path]'.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 85% While the comment is technically correct about the formatting, it's a very minor issue that doesn't affect functionality. The review rules emphasize focusing on substantive issues and not making purely informative comments. Additionally, this is part of new code being added in the diff, but it's such a minor formatting issue that it doesn't warrant a comment. The comment is accurate and points to a real inconsistency in formatting. Consistent formatting can make code more readable and maintainable. While code consistency is valuable, this is an extremely minor formatting issue that doesn't impact code understanding or functionality. The review rules specifically say not to make comments that are obvious or unimportant. The comment should be deleted as it addresses a trivial formatting issue that doesn't meaningfully impact code quality or understanding.

8. instructor/processing/multimodal.py:622

Draft comment:
Inconsistent terminology: The docstring for from_new_genai_file mentions 'uploading to Gemini API' while the integration and other parts refer to 'GenAI'. Consider standardizing the terminology for clarity.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 85% While consistency in terminology is generally good, this seems like a minor issue. The terms Gemini and GenAI both refer to Google's AI offerings and are commonly used interchangeably. The code still functions correctly regardless of which term is used. The comment doesn't point out a bug or suggest an important improvement. The comment does identify a real inconsistency in terminology. Having consistent terminology could make the code more maintainable and easier to understand. However, both terms are widely recognized in the context of Google's AI services. This is a very minor stylistic issue that doesn't impact functionality or clarity significantly. The comment should be deleted as it points out a minor stylistic issue that doesn't meaningfully impact code quality or clarity.

Workflow ID: wflow_tPoxs9xWjADvEK7C

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

ellipsis-dev · 2025-10-16T12:49:15Z

examples/video-extraction-gemini/run.py

+        VideoRecommendations object containing structured data
+    """
+    client = instructor.from_provider(
+        "google/gemini-2.0-flash-exp",


Model version mismatch: the title mentions Gemini 2.5 Pro but the provider string is 'google/gemini-2.0-flash-exp'. Ensure consistency.

Suggested change

"google/gemini-2.0-flash-exp",

"google/gemini-2.5-pro",

cloudflare-workers-and-pages · 2025-10-16T12:54:41Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Updated (UTC)
✅ Deployment successful! View logs	instructor	`6b0236d`	Oct 16 2025, 12:54 PM

devin-ai-integration bot assigned jxnl Oct 16, 2025

devin-ai-integration bot requested a review from jxnl October 16, 2025 12:46

github-actions bot added documentation Improvements or additions to documentation enhancement New feature or request python Pull requests that update python code size:M This PR changes 30-99 lines, ignoring generated files. labels Oct 16, 2025

ellipsis-dev bot reviewed Oct 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(multimodal): add Video support for Gemini with structured outputs #1851

feat(multimodal): add Video support for Gemini with structured outputs #1851

Uh oh!

devin-ai-integration bot commented Oct 16, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

devin-ai-integration bot commented Oct 16, 2025

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ellipsis-dev bot Oct 16, 2025

Uh oh!

cloudflare-workers-and-pages bot commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

feat(multimodal): add Video support for Gemini with structured outputs #1851

Are you sure you want to change the base?

feat(multimodal): add Video support for Gemini with structured outputs #1851

Uh oh!

Conversation

devin-ai-integration bot commented Oct 16, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat(multimodal): add Video support for Gemini with structured outputs

Summary

Review & Testing Checklist for Human

Recommended Test Plan

Notes

Uh oh!

devin-ai-integration bot commented Oct 16, 2025

🤖 Devin AI Engineer

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

cloudflare-workers-and-pages bot commented Oct 16, 2025

Deploying with Cloudflare Workers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

devin-ai-integration bot commented Oct 16, 2025 •

edited by ellipsis-dev bot

Loading