Skip to content

Conversation

@afg1
Copy link
Contributor

@afg1 afg1 commented Nov 26, 2025

For a long time the RNAcentral import pipeline container was built as a single container, from one massive Dockerfile. This introduces a number of problems:

  • It takes ages to build (~25-30m) because everything is compiled every time
  • It becomes more difficult to change tool dependencies/versions
  • The container is very big (>3GB)

The proper way to tackle this is to make a multi-part build, in which we build bits of the pipeline in intermediate containers and pull the build artefacts and their dependencies ONLY into the final image.

This should:

  • Mean we only build tools when we change the versions (e.g. infernal version hasn't changed in years)
  • Build tools (e.g. cargo, gcc) are not included in the final container, reducing its size
  • The build process is faster, since we build once and pull several times

This PR should implement a multi-part build of the pipeline container, and separates out tools into their own containers, from which we can pull the required build artefacts.

I've tested the new workflows, and the build now completes in 9 or so minutes, most of which time is spent pulling the docker image from the hub to convert it to singularity. We also now have explicit tool containers with tool versions attached that we can update independently if needed. Only the full on-push workflow hasn't been tested yet, because it needs to be triggered on the dev branch

@afg1 afg1 self-assigned this Nov 26, 2025
@afg1 afg1 requested review from blakesweeney and pmb59 November 26, 2025 11:30
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @afg1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request fundamentally re-architects the RNAcentral import pipeline's container build process. By transitioning from a single, large Dockerfile to a multi-stage, multi-container approach, the aim is to drastically cut down on build times, reduce the final image footprint, and enhance maintainability by isolating tool dependencies. This change optimizes the development and deployment workflow for the pipeline.

Highlights

  • Multi-stage Dockerfile Implementation: The monolithic Dockerfile for the RNAcentral import pipeline has been refactored into a multi-stage build process, significantly improving build efficiency and reducing the final image size.
  • Tool Containerization: Key tools like Infernal, Samtools/HTSlib, and custom Rust utilities are now built in their own dedicated Docker containers. This allows for independent updates and prevents build tools from being included in the final pipeline image.
  • Reduced Build Times and Image Size: The new multi-stage approach is expected to reduce build times from 25-30 minutes to approximately 9 minutes and decrease the final container size by 30-40% by only pulling necessary build artifacts into the final image.
  • Streamlined Python Environment: A dedicated Python environment builder stage is introduced to manage Python dependencies using uv, ensuring a clean and frozen environment for the final image.
  • Makefile Enhancements: The containers/Makefile has been updated with new targets to build, push, and test the newly created tool containers (Infernal, Samtools, Rust utilities), facilitating their management.
Ignored Files
  • Ignored by pattern: .github/workflows/** (3)
    • .github/workflows/main.yaml
    • .github/workflows/rust-container.yaml
    • .github/workflows/tool-containers.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent and much-needed refactoring of the pipeline's Docker build process. Converting the monolithic Dockerfile to a multi-stage, multi-container setup is a great improvement for build speed, image size, and maintainability. The use of separate, versioned tool containers and a dedicated Makefile for managing them is a solid approach. My review includes several suggestions to further refine this new structure, focusing on reducing image sizes even more by using smaller base images, improving the robustness of build scripts, and ensuring consistency across the new Makefiles and Dockerfiles.

Copy link

@pmb59 pmb59 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afg1 LTGM overall. My only major comment is that we are not following any container version strategy, only latest https://hub.docker.com/r/rnacentral/rnacentral-import-pipeline/tags
Only using latest (No version history, can't rollback) it's not good. We should tag versions, and pin a specific version into the production environment, unless this approach is not feasible for any reason

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR converts the RNAcentral import pipeline container from a monolithic single-stage build to a multi-stage build architecture. The refactoring separates bioinformatics tools (Infernal, Samtools, Rust utilities) into independent pre-built containers, significantly reducing build time from ~25-30 minutes to ~9 minutes and decreasing the final image size by 30-40%.

Key Changes:

  • Introduced separate tool containers (Infernal 1.1.2, Samtools 1.18, Rust utilities) with multi-stage builds
  • Modified main Dockerfile to pull pre-built binaries from tool containers instead of compiling from source
  • Added GitHub Actions workflows to build and push tool containers independently with change detection
  • Updated build configuration to use explicit platform targeting (linux/amd64)

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
containers/infernal/Dockerfile New multi-stage build for Infernal 1.1.2 bioinformatics tool suite
containers/samtools/Dockerfile New multi-stage build for Samtools + HTSlib 1.18 with runtime libraries only
containers/rust-utils/Dockerfile New multi-stage build compiling all 8 Rust CLI utilities in release mode
containers/Makefile Added build targets for tool containers with version management and testing
Makefile Updated to use docker buildx with explicit linux/amd64 platform
Dockerfile Converted to multi-stage build pulling from tool containers; removed build dependencies
Cargo.toml Changed workspace members from wildcard to explicit list for better dependency tracking
.github/workflows/tool-containers.yaml New workflow for building Infernal and Samtools containers with change detection
.github/workflows/rust-container.yaml New workflow for building Rust utilities container on code changes
.github/workflows/main.yaml Added synchronization to wait for Rust container build before main pipeline build
.dockerignore Restructured to exclude build artifacts, tests, and development files

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@afg1
Copy link
Contributor Author

afg1 commented Dec 12, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a fantastic improvement to the project's containerization strategy. By converting the monolithic Dockerfile to a multi-stage, multi-container build process, you've addressed several key issues: build speed, image size, and dependency management. The separation of tools into their own versioned, buildable containers is a robust and maintainable approach. The use of VERSION files and a well-structured Makefile in the containers directory for orchestration is excellent.

My review includes a few medium-severity suggestions aimed at further improving maintainability and minimizing the final image size. These include centralizing version numbers in the tool Dockerfiles using build arguments and questioning the inclusion of package managers in the final runtime image.

Overall, this is a high-quality contribution that significantly modernizes the build pipeline.

@afg1
Copy link
Contributor Author

afg1 commented Dec 12, 2025

@pmb59 I think I addressed all the things found by the bots, and your concern about versioning. We now have:

  • specified tool versions in the containers/tool/VERSION file
  • Version specific tagging for each tool that should allow rollbacks
  • Specific tool versions used in the main pipeline builder, conigured by the containers/tool/VERSION file
  • A VERSION file for the overall pipeline container
  • Build time around 7-9 minutes, compared to about 35 previously
  • Smaller container, by about 2GB

I'm pretty confident this will all work, but the automations can't be properly tested until the workflows exist on the main branch.

If you're happy, I will go ahead and merge this PR

Copy link

@pmb59 pmb59 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afg1 LGTM overall, thanks for addressing the issues. Hopefully the versioning will be useful

@afg1 afg1 merged commit 5f8d886 into dev Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants