Skip to content

Conversation

devin-ai-integration[bot]
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Oct 16, 2025

feat(multimodal): add Video support for Gemini with structured outputs

Summary

This PR adds full video support to Instructor for use with Google's Gemini models, addressing issue #1850. The implementation includes:

  • New Video classes: Added Video and VideoWithGenaiFile classes to handle video file upload and processing
  • Multimodal integration: Video support integrated into existing multimodal pipeline alongside Image, Audio, and PDF
  • Updated documentation: Refreshed the multimodal-gemini blog post with current API (from_provider instead of deprecated from_gemini)
  • Working example: Complete video extraction example in examples/video-extraction-gemini/

The Video class follows the same pattern as existing multimodal classes (Image, Audio, PDF) and supports:

  • Multiple video formats (MP4, MOV, WebM, AVI, etc.)
  • File upload to Gemini API via VideoWithGenaiFile.from_new_genai_file()
  • Google Cloud Storage URLs (gs://)
  • Base64 encoded video data
  • Automatic MIME type detection

Review & Testing Checklist for Human

⚠️ CRITICAL - This code has NOT been tested with actual video files or API calls

  • Test video upload end-to-end: Run examples/video-extraction-gemini/run.py with a real video file to verify the upload flow works
  • Verify Gemini model supports video: Confirm that gemini-2.0-flash-exp (or appropriate model) actually supports video inputs - the blog post title mentions "2.5 Pro" but code uses "2.0-flash-exp"
  • Check API compatibility: Verify that VideoWithGenaiFile.from_new_genai_file() correctly uploads videos to Gemini API (modeled after PDFWithGenaiFile but untested)
  • Test structured extraction: Confirm that structured outputs work correctly with video content (the response_model pattern)
  • Review MIME types: Validate that the video MIME types in VALID_VIDEO_MIME_TYPES are actually supported by Gemini

Recommended Test Plan

  1. Set GOOGLE_API_KEY environment variable
  2. Download a sample travel video (e.g., from YouTube)
  3. Run: python examples/video-extraction-gemini/run.py sample_video.mp4
  4. Verify the video uploads successfully
  5. Confirm structured data is extracted correctly
  6. Test the blog post code snippets independently

Notes

  • The implementation follows the same pattern as PDFWithGenaiFile for file uploads
  • Video support is Gemini-only (OpenAI and Anthropic raise NotImplementedError as they don't support video)
  • All linting checks pass
  • Resolves issue Support for Video with Structured Output (Gemini) #1850

Session: https://app.devin.ai/sessions/68339918c75c44ac8f6e3813937a0517
Requested by: Jason Liu (@jxnl)


Important

Adds video support for Gemini models with new Video classes, integrating into the multimodal pipeline and updating documentation and examples.

  • Behavior:
    • Adds Video and VideoWithGenaiFile classes in multimodal.py for video file handling and processing.
    • Integrates video support into the multimodal pipeline alongside Image, Audio, and PDF.
    • Supports multiple video formats (MP4, MOV, WebM, AVI, etc.), Google Cloud Storage URLs, and Base64 encoded data.
    • Automatic MIME type detection for videos.
  • Documentation:
    • Updates multimodal-gemini.md to use from_provider instead of from_gemini.
    • Adds a complete video extraction example in examples/video-extraction-gemini/.
  • Misc:
    • Adds Video and VideoWithGenaiFile to __init__.py exports.

This description was created by Ellipsis for 6b0236d. You can customize this summary. It will automatically update as commits are pushed.

- Add Video and VideoWithGenaiFile classes to multimodal.py
- Support video file upload and processing with Gemini API
- Export Video classes from instructor package
- Update multimodal-gemini blog post with current API
- Add complete video extraction example with Gemini 2.5 Pro
- Update autodetect_media to handle video files
- Support multiple video formats (MP4, MOV, WebM, etc.)

Resolves #1850

Co-Authored-By: Jason Liu <jason@jxnl.co>
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions github-actions bot added documentation Improvements or additions to documentation enhancement New feature or request python Pull requests that update python code size:M This PR changes 30-99 lines, ignoring generated files. labels Oct 16, 2025
Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed everything up to 6b0236d in 2 minutes and 18 seconds. Click for details.
  • Reviewed 722 lines of code in 5 files
  • Skipped 0 files when reviewing.
  • Skipped posting 8 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. instructor/processing/multimodal.py:540
  • Draft comment:
    Avoid using 'assert' for MIME type validation; use explicit exception raising so that error checks aren’t skipped in optimized mode.
  • Reason this comment was not posted:
    Confidence changes required: 80% <= threshold 85% None
2. instructor/processing/multimodal.py:557
  • Draft comment:
    Avoid using 'assert' for MIME type checks in from_path(); prefer raising a ValueError for clearer error handling.
  • Reason this comment was not posted:
    Confidence changes required: 80% <= threshold 85% None
3. instructor/processing/multimodal.py:611
  • Draft comment:
    Ensure 'self.data' is not None before performing base64 decoding in to_genai() to avoid runtime errors.
  • Reason this comment was not posted:
    Confidence changes required: 80% <= threshold 85% None
4. instructor/__init__.py:4
  • Draft comment:
    Typographical Notice: The new import includes VideoWithGenaiFile. Considering the commit title mentions Gemini, please verify if the class name should be VideoWithGeminiFile for consistency.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 85% The comment makes an assumption based on a commit title without full context. Looking at the code, there's a clear distinction between Gemini and Genai providers. VideoWithGenaiFile likely refers to Google's Genai specifically, not Gemini. The name appears intentional and correct. Additionally, suggesting renames based on commit messages rather than code context violates our review rules. I could be wrong about the distinction between Genai and Gemini - they might be the same thing with different naming conventions. The commit message might have important context I'm missing. Even if Genai and Gemini are related, the codebase clearly uses both terms distinctly in different places. We should trust the author's naming choice without more context. Delete this comment. The name appears intentional and correct within the codebase's naming patterns, and suggesting renames based on commit messages rather than code violates our review rules.
5. instructor/__init__.py:42
  • Draft comment:
    Typo check: The identifier "VideoWithGenaiFile" might be intended to be "VideoWithGenAIFile" for consistent capitalization. Please verify if this naming is correct.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 85% While the comment points out a potential naming inconsistency, without seeing the actual multimodal.py file where this class is defined, we can't be certain if this is actually inconsistent with the codebase's conventions. The comment is also just asking for verification rather than definitively stating a problem. Many codebases use "ai" lowercase in their naming. I might be too quick to dismiss this - inconsistent capitalization across a codebase can lead to confusion. Also, "AI" is very commonly capitalized in most codebases. However, one of our key rules is to not make speculative comments or ask for verification. This comment starts with "Please verify" which is explicitly against our rules. The comment should be deleted because it's asking for verification rather than definitively pointing out an issue, and we don't have enough context to be certain about the codebase's naming conventions.
6. instructor/processing/multimodal.py:47
  • Draft comment:
    Typo detection: The MIME type 'video/mov' may be incorrect. Typically, the official MIME type for QuickTime movies is 'video/quicktime'. Consider removing or correcting 'video/mov' to avoid duplication with 'video/quicktime'.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 85% While 'video/quicktime' is indeed the official MIME type for QuickTime movies, 'video/mov' is also commonly used in practice. Having both provides better compatibility. The comment is technically correct about the official standard, but removing 'video/mov' could break compatibility with systems that use this unofficial but common MIME type. The comment focuses on technical correctness but ignores practical compatibility concerns. Also, having both MIME types isn't necessarily harmful. While the comment is technically accurate about the official MIME type, keeping both variants improves real-world compatibility without causing issues. The comment should be deleted because removing 'video/mov' could reduce compatibility, and having both MIME types doesn't cause problems.
7. instructor/processing/multimodal.py:508
  • Draft comment:
    Typo in the autodetect_safely docstring: the parameter type is written as 'Union[str,path]'. Consider adding a space after the comma, e.g., 'Union[str, Path]'.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 85% While the comment is technically correct about the formatting, it's a very minor issue that doesn't affect functionality. The review rules emphasize focusing on substantive issues and not making purely informative comments. Additionally, this is part of new code being added in the diff, but it's such a minor formatting issue that it doesn't warrant a comment. The comment is accurate and points to a real inconsistency in formatting. Consistent formatting can make code more readable and maintainable. While code consistency is valuable, this is an extremely minor formatting issue that doesn't impact code understanding or functionality. The review rules specifically say not to make comments that are obvious or unimportant. The comment should be deleted as it addresses a trivial formatting issue that doesn't meaningfully impact code quality or understanding.
8. instructor/processing/multimodal.py:622
  • Draft comment:
    Inconsistent terminology: The docstring for from_new_genai_file mentions 'uploading to Gemini API' while the integration and other parts refer to 'GenAI'. Consider standardizing the terminology for clarity.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 85% While consistency in terminology is generally good, this seems like a minor issue. The terms Gemini and GenAI both refer to Google's AI offerings and are commonly used interchangeably. The code still functions correctly regardless of which term is used. The comment doesn't point out a bug or suggest an important improvement. The comment does identify a real inconsistency in terminology. Having consistent terminology could make the code more maintainable and easier to understand. However, both terms are widely recognized in the context of Google's AI services. This is a very minor stylistic issue that doesn't impact functionality or clarity significantly. The comment should be deleted as it points out a minor stylistic issue that doesn't meaningfully impact code quality or clarity.

Workflow ID: wflow_tPoxs9xWjADvEK7C

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

VideoRecommendations object containing structured data
"""
client = instructor.from_provider(
"google/gemini-2.0-flash-exp",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model version mismatch: the title mentions Gemini 2.5 Pro but the provider string is 'google/gemini-2.0-flash-exp'. Ensure consistency.

Suggested change
"google/gemini-2.0-flash-exp",
"google/gemini-2.5-pro",

Copy link

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
✅ Deployment successful!
View logs
instructor 6b0236d Oct 16 2025, 12:54 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request python Pull requests that update python code size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant