-
Notifications
You must be signed in to change notification settings - Fork 3
GitHub Action to build and upload Conda package whenever a new release is made #67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
3d27932
eb77b89
50af3d9
070922f
874fb2c
a25953c
8af9804
1ccf2de
b569af1
c20e680
f188bf9
c6c94b8
3435668
67fe6da
62944f5
d77f238
f1674cb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| name: Build and Upload Conda Package When Release is Published | ||
|
|
||
| # Triggers when a release is made (in our case by the run_deploy_on_merge.yml action) | ||
| on: | ||
| workflow_dispatch: | ||
| release: | ||
| types: [published] | ||
|
|
||
| env: | ||
| ANACONDA_API_TOKEN: ${{ secrets.CONDA_TOKEN }} | ||
| PHG2_VERSION: ${{ vars.PHG2_VERSION }} | ||
| PHG2_RELEASE: ${{ vars.PHG2_RELEASE }} | ||
| PHG2_VERSION_MD5: ${{ vars.PHG2_VERSION_MD5 }} | ||
|
|
||
| jobs: | ||
| build-upload-conda: | ||
| runs-on: ubuntu-latest | ||
| steps: | ||
| - name: Checkout | ||
| uses: actions/checkout@v3 | ||
| - name: Setup Conda | ||
| uses: conda-incubator/setup-miniconda@v3 | ||
| with: | ||
| auto-update-conda: true | ||
| python-version: 3.11 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason to use python 3.11 versus something newer? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The newest version is 3.12 (3.13 is still prerelease) so we are pretty "bleeding edge" here. The Python version doesn't really matter all that much for this anyway, I simply included it so that it's explicit in case it breaks in the future. Hugging Face uses 3.8 in their v2 usage. If you go to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My mistake/ I was thinking this was 2.11, not 3.11. Ignore my comment/ |
||
| channels: conda-forge | ||
| - name: Install Conda build dependencies | ||
| run: | | ||
| conda install -c defaults anaconda-client conda-build | ||
| conda config --set anaconda_upload yes | ||
| - name: Build and upload package | ||
| run: | | ||
| conda build ./conda | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -152,3 +152,23 @@ jobs: | |
| ${{ steps.matching_release.outputs.body }} | ||
| ${{ env.COMMIT_MSG }} | ||
|
|
||
| # Update repository variables based on new release version and its MD5 hash | ||
| # These are used by the Conda package build Action | ||
| - uses: action-pack/set-variable@v1 | ||
| with: | ||
| name: 'PHG2_VERSION' | ||
| value: '${{ env.VERSION }}' | ||
| token: ${{secrets.PHGV2CD}} | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I realize "secrets.<>" is used in existing code. Where are the secrets values stored (ie what file is accessed?). Do we have these created for maizegenetics.net, or is this per project? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nevermind - I see from your slack posting this was generated for maize-genetics |
||
| - uses: action-pack/set-variable@v1 | ||
| with: | ||
| name: 'PHG2_RELEASE' | ||
| value: '${{ env.RELEASE }}' | ||
| token: ${{secrets.PHGV2CD}} | ||
| - name: Calculate MD5 of package | ||
| run: | | ||
| echo "PHG2_VERSION_MD5=$(md5sum PHGv2-v${{ env.RELEASE }}.tar | cut -d ' ' -f 1)" >> $GITHUB_ENV | ||
| - uses: action-pack/set-variable@v1 | ||
| with: | ||
| name: 'PHG2_VERSION_MD5' | ||
| value: '${{ env.PHG2_VERSION_MD5 }}' | ||
| token: ${{secrets.PHGV2CD}} | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,7 @@ | ||
| # PHG version 2 | ||
| > [!TIP] | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Put this tip under the badges. |
||
| > The quickest way to install PHG2 is from Conda: `conda install -c conda-forge -c maize-genetics phg2` | ||
|
|
||
| [](https://github.com/maize-genetics/phg_v2/actions/workflows/phgv2_ci.yml) [](https://codecov.io/gh/maize-genetics/phg_v2) [](https://opensource.org/licenses/Apache-2.0) | ||
|
|
||
| The Practical Haplotype Graph (PHG) is a powerful tool for representing pangenomes. The PHG is optimized for the plant breeding and genetics, where genomic diversity can be high, phased haplotypes are common (e.g. inbred lines), and imputation with low density markers is essential for breeding efficiency. This is powerful complement to the excellent tools such as [BEAGLE](https://faculty.washington.edu/browning/beagle/beagle.html) that is used extensively in low diversity, unphased species with high density genotyping. | ||
|
|
@@ -7,7 +10,31 @@ The PHG is a trellis graph based representation of genic and intergenic regions | |
|
|
||
| [PHGv1](https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/Home) was [published in 2022](https://doi.org/10.1093/bioinformatics/btac410). It addressed many challenges related to aligning diverse genomes, efficient storage, and imputation across a pangenome. However, it depended on a custom relational database that necessitated unique formats, and database queries did not scale effectively with a large number of taxa and rare alleles. Moreover, after developing PHGs for six species, we identified significant opportunities to refine and streamline the platform for curation. | ||
|
|
||
| # PHGv2 design | ||
| ## Quick Start | ||
| > [!IMPORTANT] | ||
| > PHG2 is still under active development, please regularly check back for updates! | ||
|
|
||
| You can download PHG2 from [the releases](https://github.com/maize-genetics/phg_v2/releases) page, but we recommend installing through Conda: | ||
|
|
||
| ``` | ||
| conda install --channel conda-forge --channel maize-genetics phg2 | ||
| ``` | ||
|
|
||
| We recommend creating a dedicated Conda environment for each of your bioinformatics projects, as opposed to installing directly into the base environment; please refer to Conda's documentation [on managing environments](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html#managing-environments). To create an environment and install PHG2 in a single command, do as follows: | ||
|
|
||
| ``` | ||
| conda create --name my_env --channel conda-forge --channel maize-genetics phg2 | ||
| ``` | ||
|
|
||
| If you've already installed PHG2, you can update to the newest version via: | ||
|
|
||
| ``` | ||
| conda update phg2 | ||
| ``` | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on these changes, the documentation indicates the user can only execute phg from the conda setup (the original ./phg commands have been removed). Did we decide Conda is an option, or did we decide conda is the only option for installing phg? If the latter, the instructions in the PHGv2- Building and Loading have an inconsistency. Is our program named "phg" or "phg2" ? (we need to make a collective decision) It looks like we'll have 2 environments - one that contains the phg(2) executable, and one that is created by phg(2) to run agc, tiledbvcf, etc. Is that correct? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW, I'm not opposed to loading phg only via conda. I just want to ensure our documentation has consistent examples and that we're all on the same page. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't think we decided anything, this is a question for @zrm22 or the group, same as the naming question.
There is only a single environment, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think for now I would prefer explaining both the conda install and tarball installation instructions. I agree that long term if the user installs through conda we should only have one environment(no more need for SetupEnvironment) with all the dependencies, but I think it gets tricky as every time we make a call to anchorwave, tileDB, bgzip or bcftools we need to wrap the command that ProcessBuilder executes within a conda run command where we need to specify the environment name. The way we have this setup its consistent as we give the environment a name. If the user needs to create their environment and then just add the PHG conda env to it, we no longer have this control. To fix either we need to have Processbuilder first figure out what envs are there and try to determine which one is the PHG one(what happens if you have more than 1?) or we have the user submit the name as a param(this would need to basically be in every command). We may be able to have the user create an Environment variable that gets picked up by Clikt automatically, but it is hard to say. |
||
|
|
||
| Please refer to the [building and loading documentation](docs/build_and_load.md) for instructions on getting started with using PHG2. | ||
|
|
||
| ## Design | ||
| The redesign leverages the powerful TileDB-VCF database, which is widely used in human genetics for extensive medical applications and is highly performant for rapid querying and storage of rare variants. The PHG is now backed by two TileDB-VCF databases: one for tracking haplotypes across all samples (h.vcf), and another for tracking variants relative to either the reference genomes or the closest haplotype (g.vcf). Our implementation of haplotype encoding in VCF heavily relies on the VCF ALT haplotype specification [v4.3](http://samtools.github.io/hts-specs/VCFv4.3.pdf). | ||
|
|
||
| * High-quality phased genome assemblies (or similar) are available to initialize the PHG. | ||
|
|
@@ -17,7 +44,7 @@ The redesign leverages the powerful TileDB-VCF database, which is widely used in | |
| * Genotyping with low-density markers is now done using a memory- and speed-efficient kmer approach, followed by pathfinding (imputation) with HMM, BWT, or our ML model. | ||
| * Rare allele discovery with short reads is based on the above path, involving short read alignment to the inferred haplotype path genome and the GATK haplotype caller. | ||
|
|
||
| # PHG terminology | ||
| ## Terminology | ||
|
|
||
| Reference genome - the genome used for initial alignment and base coordinates | ||
| Reference range - a segment of the reference genome | ||
|
|
@@ -27,56 +54,3 @@ The redesign leverages the powerful TileDB-VCF database, which is widely used in | |
| Composite Reference Haplotypes | ||
|
|
||
| More information on terminology can be found [here](docs/terminology.md). | ||
|
|
||
| # Example usage | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These need to be brought back in. We are making an effort so the user does not need to jump through 3-4 pages just to get basic information to run the software. Obviously this is not all the documentation, but having the 10-15 commands that need to be run at the main page of the Repo is a good idea for what we are trying to do here. |
||
| To populate that database | ||
| ``` | ||
| ## Setup conda environment | ||
| ./phg setup-environment | ||
|
|
||
| ## Initialize DBs | ||
| ./phg initdb --db-path /path/to/dbs | ||
|
|
||
| ## Preprocessing data | ||
| ./phg annotate-fastas --keyfile /path/to/keyfile --output-dir /path/to/annotated/fastas --threads numberThreadstoRun | ||
|
|
||
| ## Build VCF data | ||
| ./phg create-ranges --reference-file Ref.fa --gff my.gff --boundary gene --pad 500 -o /path/to/bed/file.bed | ||
| ./phg align-assemblies --gff anchors.gff --reference-file Ref.fa -a assembliesList.txt --total-threads 20 --in-parallel 4 -o /path/for/generatedFiles | ||
| ./phg agc-compress --db-path /path/to/dbs --reference-file /my/ref.fasta --fasta-list /my/assemblyFastaList.txt | ||
| ./phg create-ref-vcf --bed /my/bed/file.bed --reference-file /my/ref.fasta --reference-url https://url-for-ref --reference-name B73 --output-dir /path/to/vcfs | ||
| ./phg create-maf-vcf --db-path /path/to/dbs --bed /my/bed/file.bed --reference-file /my/ref.fasta --maf-dir /my/maf/files -o /path/to/vcfs | ||
|
|
||
| ## Load data into DBs | ||
| ./phg load-vcf --vcf /my/vcf/dir --dbpath /path/to/dbs | ||
| ``` | ||
|
|
||
| ``` | ||
| ## Index | ||
| ./phg index-kmers --ancestor founder.h.vcf -o kmer_index.map // we need this | ||
|
|
||
| ## Map | ||
| ./phg map-kmers \ | ||
| --kmer-index kmer_index.map \ | ||
| --reads my_reads.fastq \ // possibly thousands of samples being inputted | ||
| --output read_count_out.map \ // could we pipe this into impute method? // thousands of outputs | ||
| // consider batch interface here ^^ | ||
|
|
||
| ## Impute | ||
| ./phg impute \ | ||
| --hap-counts read_count_out.map \ // will users understand the di | ||
| --diploid false \ | ||
| --ancestor founder.h.vcf \ | ||
| --max-anc-hap-num 20 \ | ||
| --max-anc-hap-prop 0.95 \ | ||
| --output-parent best_parents.txt \ | ||
| -o my_impute.h.vcf | ||
|
|
||
| ## Load | ||
| ./phg load-vcf --vcf my_impute.vcf --dbpath /my/db/uri | ||
| ``` | ||
|
|
||
| ``` | ||
| ## Export from Tiledb | ||
| ./phg export-vcf --db-path /my/db/uri --dataset-type hvcf --sample-Names LineA,LineB --output-dir /my/output/dir | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| #!/usr/bin/env bash | ||
| mkdir -p $PREFIX/bin | ||
|
|
||
| PHG2_DIR=$PREFIX/share/phg2-$PKG_VERSION-$PKG_BUILDNUM/ | ||
|
|
||
| mkdir -p $PHG2_DIR | ||
|
|
||
| mv $SRC_DIR/* $PHG2_DIR | ||
|
|
||
| # Soft symlink to "point" to phg script, as a hard symlink | ||
| # leads to being unable to find the jars in lib/ | ||
| # Helpful: https://stackoverflow.com/a/29786294 | ||
| ln -s $PHG2_DIR/bin/phg $PREFIX/bin/phg2 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,31 @@ | ||
| {% set name = "PHG2" %} | ||
|
|
||
| package: | ||
| name: {{ name|lower }} | ||
| version: "{{ PHG2_VERSION }}" | ||
|
|
||
| source: | ||
| url: https://github.com/maize-genetics/phg_v2/releases/download/{{ PHG2_VERSION }}/PHGv2-v{{ PHG2_RELEASE }}.tar | ||
| md5: "{{ PHG2_VERSION_MD5 }}" | ||
|
|
||
| build: | ||
| noarch: generic | ||
|
|
||
| # NOTE: Dependencies like AnchorWave are not included here | ||
| # since they are used within the phgv2-conda environment. | ||
| # In future if it's desired to move away from that design | ||
| # decision those dependencies can be added and the reliance | ||
| # on a bespoke environment for execution can be removed. | ||
| requirements: | ||
| run: | ||
| - openjdk >=17.0 | ||
|
|
||
| test: | ||
| commands: | ||
| - 'phg2 | grep "Usage: phg"' | ||
|
|
||
| about: | ||
| home: https://github.com/maize-genetics/phg_v2 | ||
| license: Apache-2.0 | ||
| summary: 'The Practical Haplotype Graph (PHG) is a powerful tool for representing pangenomes.' | ||
| dev_url: https://github.com/maize-genetics/phg_v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you update the README.md installation section with information on how to install this? Brandon is writing the detailed documentation, but individually we are updating the simple usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure can do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know what you think.
We can add more to the Quick Start in future - I wasn't sure how much we wanted included since stuff is still in flux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is my concern: If they pull the apps into the base conda environment, then the programs are seen via ProcessBuilder(). I think this is what you originally tested. I did not understand if this worked when our Clikt commands precede the ProcessBuilder() commands with a conda environment setting. Do all environments inherit what is in the base environment? if yes, then this was ok.
But I think we agreed we don't want to encourage adding to the base environment.
If users create a new conda environment and load into that this will not work with the ProcessBuilder() commands unless the environment is one that we know. So either the user should always create a conda environment named phgv2-conda , or they will have to pass a new paramter with the conda environment name to every class that has a ProcessBuilder call.
let me know if I'm missing something here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see here for a response - I wanted to make sure everybody was on the same page and understood the discussion from this morning.