Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions .github/workflows/build_upload_conda.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: Build and Upload Conda Package When Release is Published

# Triggers when a release is made (in our case by the run_deploy_on_merge.yml action)
on:
workflow_dispatch:
release:
types: [published]

env:
ANACONDA_API_TOKEN: ${{ secrets.CONDA_TOKEN }}
PHG2_VERSION: ${{ vars.PHG2_VERSION }}
PHG2_RELEASE: ${{ vars.PHG2_RELEASE }}
PHG2_VERSION_MD5: ${{ vars.PHG2_VERSION_MD5 }}

jobs:
build-upload-conda:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you update the README.md installation section with information on how to install this? Brandon is writing the detailed documentation, but individually we are updating the simple usage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure can do!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what you think.

We can add more to the Quick Start in future - I wasn't sure how much we wanted included since stuff is still in flux.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my concern: If they pull the apps into the base conda environment, then the programs are seen via ProcessBuilder(). I think this is what you originally tested. I did not understand if this worked when our Clikt commands precede the ProcessBuilder() commands with a conda environment setting. Do all environments inherit what is in the base environment? if yes, then this was ok.

But I think we agreed we don't want to encourage adding to the base environment.

If users create a new conda environment and load into that this will not work with the ProcessBuilder() commands unless the environment is one that we know. So either the user should always create a conda environment named phgv2-conda , or they will have to pass a new paramter with the conda environment name to every class that has a ProcessBuilder call.

let me know if I'm missing something here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see here for a response - I wanted to make sure everybody was on the same page and understood the discussion from this morning.

runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Setup Conda
uses: conda-incubator/setup-miniconda@v3
with:
auto-update-conda: true
python-version: 3.11
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to use python 3.11 versus something newer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newest version is 3.12 (3.13 is still prerelease) so we are pretty "bleeding edge" here. The Python version doesn't really matter all that much for this anyway, I simply included it so that it's explicit in case it breaks in the future.

Hugging Face uses 3.8 in their v2 usage. If you go to the setup-miniconda repo their examples include a variety of versions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake/ I was thinking this was 2.11, not 3.11. Ignore my comment/

channels: conda-forge
- name: Install Conda build dependencies
run: |
conda install -c defaults anaconda-client conda-build
conda config --set anaconda_upload yes
- name: Build and upload package
run: |
conda build ./conda
20 changes: 20 additions & 0 deletions .github/workflows/run_deploy_on_merge.yml
Original file line number Diff line number Diff line change
Expand Up @@ -152,3 +152,23 @@ jobs:
${{ steps.matching_release.outputs.body }}
${{ env.COMMIT_MSG }}

# Update repository variables based on new release version and its MD5 hash
# These are used by the Conda package build Action
- uses: action-pack/set-variable@v1
with:
name: 'PHG2_VERSION'
value: '${{ env.VERSION }}'
token: ${{secrets.PHGV2CD}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize "secrets.<>" is used in existing code. Where are the secrets values stored (ie what file is accessed?). Do we have these created for maizegenetics.net, or is this per project?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind - I see from your slack posting this was generated for maize-genetics

- uses: action-pack/set-variable@v1
with:
name: 'PHG2_RELEASE'
value: '${{ env.RELEASE }}'
token: ${{secrets.PHGV2CD}}
- name: Calculate MD5 of package
run: |
echo "PHG2_VERSION_MD5=$(md5sum PHGv2-v${{ env.RELEASE }}.tar | cut -d ' ' -f 1)" >> $GITHUB_ENV
- uses: action-pack/set-variable@v1
with:
name: 'PHG2_VERSION_MD5'
value: '${{ env.PHG2_VERSION_MD5 }}'
token: ${{secrets.PHGV2CD}}
84 changes: 29 additions & 55 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
# PHG version 2
> [!TIP]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this tip under the badges.

> The quickest way to install PHG2 is from Conda: `conda install -c conda-forge -c maize-genetics phg2`

[![PHGv2 CI](https://github.com/maize-genetics/phg_v2/actions/workflows/phgv2_ci.yml/badge.svg)](https://github.com/maize-genetics/phg_v2/actions/workflows/phgv2_ci.yml) [![codecov](https://codecov.io/gh/maize-genetics/phg_v2/graph/badge.svg?token=4BVD2QXQ1A)](https://codecov.io/gh/maize-genetics/phg_v2) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

The Practical Haplotype Graph (PHG) is a powerful tool for representing pangenomes. The PHG is optimized for the plant breeding and genetics, where genomic diversity can be high, phased haplotypes are common (e.g. inbred lines), and imputation with low density markers is essential for breeding efficiency. This is powerful complement to the excellent tools such as [BEAGLE](https://faculty.washington.edu/browning/beagle/beagle.html) that is used extensively in low diversity, unphased species with high density genotyping.
Expand All @@ -7,7 +10,31 @@ The PHG is a trellis graph based representation of genic and intergenic regions

[PHGv1](https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/Home) was [published in 2022](https://doi.org/10.1093/bioinformatics/btac410). It addressed many challenges related to aligning diverse genomes, efficient storage, and imputation across a pangenome. However, it depended on a custom relational database that necessitated unique formats, and database queries did not scale effectively with a large number of taxa and rare alleles. Moreover, after developing PHGs for six species, we identified significant opportunities to refine and streamline the platform for curation.

# PHGv2 design
## Quick Start
> [!IMPORTANT]
> PHG2 is still under active development, please regularly check back for updates!

You can download PHG2 from [the releases](https://github.com/maize-genetics/phg_v2/releases) page, but we recommend installing through Conda:

```
conda install --channel conda-forge --channel maize-genetics phg2
```

We recommend creating a dedicated Conda environment for each of your bioinformatics projects, as opposed to installing directly into the base environment; please refer to Conda's documentation [on managing environments](https://conda.io/projects/conda/en/latest/user-guide/getting-started.html#managing-environments). To create an environment and install PHG2 in a single command, do as follows:

```
conda create --name my_env --channel conda-forge --channel maize-genetics phg2
```

If you've already installed PHG2, you can update to the newest version via:

```
conda update phg2
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on these changes, the documentation indicates the user can only execute phg from the conda setup (the original ./phg commands have been removed).

Did we decide Conda is an option, or did we decide conda is the only option for installing phg?

If the latter, the instructions in the PHGv2- Building and Loading have an inconsistency. Is our program named "phg" or "phg2" ? (we need to make a collective decision)

It looks like we'll have 2 environments - one that contains the phg(2) executable, and one that is created by phg(2) to run agc, tiledbvcf, etc. Is that correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I'm not opposed to loading phg only via conda. I just want to ensure our documentation has consistent examples and that we're all on the same page.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we decide Conda is an option, or did we decide conda is the only option for installing phg?

I don't think we decided anything, this is a question for @zrm22 or the group, same as the naming question.

It looks like we'll have 2 environments - one that contains the phg(2) executable, and one that is created by phg(2) to run agc, tiledbvcf, etc. Is that correct?

There is only a single environment, phgv2-conda. The other environment containing the actual program (the Conda package) isn't really relevant as I wrote in my long comment above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now I would prefer explaining both the conda install and tarball installation instructions.

I agree that long term if the user installs through conda we should only have one environment(no more need for SetupEnvironment) with all the dependencies, but I think it gets tricky as every time we make a call to anchorwave, tileDB, bgzip or bcftools we need to wrap the command that ProcessBuilder executes within a conda run command where we need to specify the environment name.

The way we have this setup its consistent as we give the environment a name. If the user needs to create their environment and then just add the PHG conda env to it, we no longer have this control.

To fix either we need to have Processbuilder first figure out what envs are there and try to determine which one is the PHG one(what happens if you have more than 1?) or we have the user submit the name as a param(this would need to basically be in every command). We may be able to have the user create an Environment variable that gets picked up by Clikt automatically, but it is hard to say.


Please refer to the [building and loading documentation](docs/build_and_load.md) for instructions on getting started with using PHG2.

## Design
The redesign leverages the powerful TileDB-VCF database, which is widely used in human genetics for extensive medical applications and is highly performant for rapid querying and storage of rare variants. The PHG is now backed by two TileDB-VCF databases: one for tracking haplotypes across all samples (h.vcf), and another for tracking variants relative to either the reference genomes or the closest haplotype (g.vcf). Our implementation of haplotype encoding in VCF heavily relies on the VCF ALT haplotype specification [v4.3](http://samtools.github.io/hts-specs/VCFv4.3.pdf).

* High-quality phased genome assemblies (or similar) are available to initialize the PHG.
Expand All @@ -17,7 +44,7 @@ The redesign leverages the powerful TileDB-VCF database, which is widely used in
* Genotyping with low-density markers is now done using a memory- and speed-efficient kmer approach, followed by pathfinding (imputation) with HMM, BWT, or our ML model.
* Rare allele discovery with short reads is based on the above path, involving short read alignment to the inferred haplotype path genome and the GATK haplotype caller.

# PHG terminology
## Terminology

Reference genome - the genome used for initial alignment and base coordinates
Reference range - a segment of the reference genome
Expand All @@ -27,56 +54,3 @@ The redesign leverages the powerful TileDB-VCF database, which is widely used in
Composite Reference Haplotypes

More information on terminology can be found [here](docs/terminology.md).

# Example usage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be brought back in. We are making an effort so the user does not need to jump through 3-4 pages just to get basic information to run the software. Obviously this is not all the documentation, but having the 10-15 commands that need to be run at the main page of the Repo is a good idea for what we are trying to do here.

To populate that database
```
## Setup conda environment
./phg setup-environment

## Initialize DBs
./phg initdb --db-path /path/to/dbs

## Preprocessing data
./phg annotate-fastas --keyfile /path/to/keyfile --output-dir /path/to/annotated/fastas --threads numberThreadstoRun

## Build VCF data
./phg create-ranges --reference-file Ref.fa --gff my.gff --boundary gene --pad 500 -o /path/to/bed/file.bed
./phg align-assemblies --gff anchors.gff --reference-file Ref.fa -a assembliesList.txt --total-threads 20 --in-parallel 4 -o /path/for/generatedFiles
./phg agc-compress --db-path /path/to/dbs --reference-file /my/ref.fasta --fasta-list /my/assemblyFastaList.txt
./phg create-ref-vcf --bed /my/bed/file.bed --reference-file /my/ref.fasta --reference-url https://url-for-ref --reference-name B73 --output-dir /path/to/vcfs
./phg create-maf-vcf --db-path /path/to/dbs --bed /my/bed/file.bed --reference-file /my/ref.fasta --maf-dir /my/maf/files -o /path/to/vcfs

## Load data into DBs
./phg load-vcf --vcf /my/vcf/dir --dbpath /path/to/dbs
```

```
## Index
./phg index-kmers --ancestor founder.h.vcf -o kmer_index.map // we need this

## Map
./phg map-kmers \
--kmer-index kmer_index.map \
--reads my_reads.fastq \ // possibly thousands of samples being inputted
--output read_count_out.map \ // could we pipe this into impute method? // thousands of outputs
// consider batch interface here ^^

## Impute
./phg impute \
--hap-counts read_count_out.map \ // will users understand the di
--diploid false \
--ancestor founder.h.vcf \
--max-anc-hap-num 20 \
--max-anc-hap-prop 0.95 \
--output-parent best_parents.txt \
-o my_impute.h.vcf

## Load
./phg load-vcf --vcf my_impute.vcf --dbpath /my/db/uri
```

```
## Export from Tiledb
./phg export-vcf --db-path /my/db/uri --dataset-type hvcf --sample-Names LineA,LineB --output-dir /my/output/dir

13 changes: 13 additions & 0 deletions conda/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/usr/bin/env bash
mkdir -p $PREFIX/bin

PHG2_DIR=$PREFIX/share/phg2-$PKG_VERSION-$PKG_BUILDNUM/

mkdir -p $PHG2_DIR

mv $SRC_DIR/* $PHG2_DIR

# Soft symlink to "point" to phg script, as a hard symlink
# leads to being unable to find the jars in lib/
# Helpful: https://stackoverflow.com/a/29786294
ln -s $PHG2_DIR/bin/phg $PREFIX/bin/phg2
31 changes: 31 additions & 0 deletions conda/meta.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
{% set name = "PHG2" %}

package:
name: {{ name|lower }}
version: "{{ PHG2_VERSION }}"

source:
url: https://github.com/maize-genetics/phg_v2/releases/download/{{ PHG2_VERSION }}/PHGv2-v{{ PHG2_RELEASE }}.tar
md5: "{{ PHG2_VERSION_MD5 }}"

build:
noarch: generic

# NOTE: Dependencies like AnchorWave are not included here
# since they are used within the phgv2-conda environment.
# In future if it's desired to move away from that design
# decision those dependencies can be added and the reliance
# on a bespoke environment for execution can be removed.
requirements:
run:
- openjdk >=17.0

test:
commands:
- 'phg2 | grep "Usage: phg"'

about:
home: https://github.com/maize-genetics/phg_v2
license: Apache-2.0
summary: 'The Practical Haplotype Graph (PHG) is a powerful tool for representing pangenomes.'
dev_url: https://github.com/maize-genetics/phg_v2