-
Notifications
You must be signed in to change notification settings - Fork 1
Convert pipeline container to a multi-part build #220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…container wait for a rust build if needed
Summary of ChangesHello @afg1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request fundamentally re-architects the RNAcentral import pipeline's container build process. By transitioning from a single, large Dockerfile to a multi-stage, multi-container approach, the aim is to drastically cut down on build times, reduce the final image footprint, and enhance maintainability by isolating tool dependencies. This change optimizes the development and deployment workflow for the pipeline. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This is an excellent and much-needed refactoring of the pipeline's Docker build process. Converting the monolithic Dockerfile to a multi-stage, multi-container setup is a great improvement for build speed, image size, and maintainability. The use of separate, versioned tool containers and a dedicated Makefile for managing them is a solid approach. My review includes several suggestions to further refine this new structure, focusing on reducing image sizes even more by using smaller base images, improving the robustness of build scripts, and ensuring consistency across the new Makefiles and Dockerfiles.
pmb59
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afg1 LTGM overall. My only major comment is that we are not following any container version strategy, only latest https://hub.docker.com/r/rnacentral/rnacentral-import-pipeline/tags
Only using latest (No version history, can't rollback) it's not good. We should tag versions, and pin a specific version into the production environment, unless this approach is not feasible for any reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR converts the RNAcentral import pipeline container from a monolithic single-stage build to a multi-stage build architecture. The refactoring separates bioinformatics tools (Infernal, Samtools, Rust utilities) into independent pre-built containers, significantly reducing build time from ~25-30 minutes to ~9 minutes and decreasing the final image size by 30-40%.
Key Changes:
- Introduced separate tool containers (Infernal 1.1.2, Samtools 1.18, Rust utilities) with multi-stage builds
- Modified main Dockerfile to pull pre-built binaries from tool containers instead of compiling from source
- Added GitHub Actions workflows to build and push tool containers independently with change detection
- Updated build configuration to use explicit platform targeting (linux/amd64)
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
containers/infernal/Dockerfile |
New multi-stage build for Infernal 1.1.2 bioinformatics tool suite |
containers/samtools/Dockerfile |
New multi-stage build for Samtools + HTSlib 1.18 with runtime libraries only |
containers/rust-utils/Dockerfile |
New multi-stage build compiling all 8 Rust CLI utilities in release mode |
containers/Makefile |
Added build targets for tool containers with version management and testing |
Makefile |
Updated to use docker buildx with explicit linux/amd64 platform |
Dockerfile |
Converted to multi-stage build pulling from tool containers; removed build dependencies |
Cargo.toml |
Changed workspace members from wildcard to explicit list for better dependency tracking |
.github/workflows/tool-containers.yaml |
New workflow for building Infernal and Samtools containers with change detection |
.github/workflows/rust-container.yaml |
New workflow for building Rust utilities container on code changes |
.github/workflows/main.yaml |
Added synchronization to wait for Rust container build before main pipeline build |
.dockerignore |
Restructured to exclude build artifacts, tests, and development files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Pedro Madrigal <8195212+pmb59@users.noreply.github.com>
…ntral-import-pipeline into docker-multi-part-build
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a fantastic improvement to the project's containerization strategy. By converting the monolithic Dockerfile to a multi-stage, multi-container build process, you've addressed several key issues: build speed, image size, and dependency management. The separation of tools into their own versioned, buildable containers is a robust and maintainable approach. The use of VERSION files and a well-structured Makefile in the containers directory for orchestration is excellent.
My review includes a few medium-severity suggestions aimed at further improving maintainability and minimizing the final image size. These include centralizing version numbers in the tool Dockerfiles using build arguments and questioning the inclusion of package managers in the final runtime image.
Overall, this is a high-quality contribution that significantly modernizes the build pipeline.
|
@pmb59 I think I addressed all the things found by the bots, and your concern about versioning. We now have:
I'm pretty confident this will all work, but the automations can't be properly tested until the workflows exist on the main branch. If you're happy, I will go ahead and merge this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afg1 LGTM overall, thanks for addressing the issues. Hopefully the versioning will be useful
For a long time the RNAcentral import pipeline container was built as a single container, from one massive Dockerfile. This introduces a number of problems:
The proper way to tackle this is to make a multi-part build, in which we build bits of the pipeline in intermediate containers and pull the build artefacts and their dependencies ONLY into the final image.
This should:
This PR should implement a multi-part build of the pipeline container, and separates out tools into their own containers, from which we can pull the required build artefacts.
I've tested the new workflows, and the build now completes in 9 or so minutes, most of which time is spent pulling the docker image from the hub to convert it to singularity. We also now have explicit tool containers with tool versions attached that we can update independently if needed. Only the full on-push workflow hasn't been tested yet, because it needs to be triggered on the dev branch