diff --git a/README.md b/README.md index 0229a3157..71ec59529 100644 --- a/README.md +++ b/README.md @@ -6,29 +6,52 @@ This branch contains test data to be used for automated testing with the [nf-cor ### `samplesheet/` +This directory contains the sample sheets used to test different test configurations of the `nf-core/viralrecon` pipeline. + #### `samplesheet_test_nanopore.csv` -Sample information sheet required to test the pipeline containing sample names and barcodes for MinION data hosted in this repository. For testing purposes, some barcodes have been appended to this samplesheet that may not necessarily have associated data. The raw data associated with this run can be found in [`nanopore/minion`](nanopore/minion), and has been sub-setted to include a maximum of 3 `fast5`/`fastq`files per barcode. +Sample sheet for Nanopore test data. +Includes SARS-CoV-2 sample names and MinION barcodes hosted in this repository. +For testing purposes, some barcodes are included without associated data. +Raw data can be found in [`nanopore/minion`](nanopore/minion) and have been subsetted to include a maximum of three `fast5` or `fastq` files per barcode. #### `samplesheet_test_sra.csv` -Sample information sheet required to test the pipeline containing sample information of one link to a original full FastQ files and two files that must be downloaded from SRA one single-end and one paired-end, respectively. This sample sheet corresponds to Illumina SISPA data. +Sample sheet for SISPA-based Illumina data. +Contains SARS-CoV-2 SISPA probe enriched sample information, including one link to original full FastQ files and two datasets to be downloaded from SRA (one single-end and one paired-end). +Used to test SRA-based inputs and mixed dataset handling. #### `samplesheet_test_illumina_sispa.csv` -Sample information sheet required to test the pipeline containing sample information and links to FastQ files stored in this repository. This sample sheet corresponds to Illumina SISPA data. +Sample sheet for small-scale SISPA test data. +Contains SARS-CoV-2 SISPA probe enriched sample information and links to corresponding FastQ files. #### `samplesheet_test_illumina_amplicon.csv` -Sample information sheet required to test the pipeline containing sample information and links to FastQ files stored in this repository. This sample sheet corresponds to Illumina amplicon data. +Sample sheet for small-scale amplicon test data. +Includes SARS-CoV-2 amplicon primer enriched sample sample information and links to FastQ files hosted in this repository. #### `samplesheet_full_illumina_sispa.csv` -Sample information sheet required to test the pipeline containing sample information and links to original full FastQ files. This sample sheet corresponds to Illumina SISPA data. +Sample sheet for full-scale SISPA test data. +Contains SARS-CoV-2 SISPA probe enriched sample information and links to corresponding FastQ files. #### `samplesheet_full_illumina_amplicon.csv` -Sample information sheet required to test the pipeline containing sample information and links to original full FastQ files. This sample sheet corresponds to Illumina amplicon data. +Sample sheet for full-scale amplicon test data. +Includes SARS-CoV-2 amplicon primer enriched sample sample information and links to FastQ files hosted in this repository. + +#### `samplesheet_full_illumina_fragmented.csv` + +Sample sheet for fragmented genome tests. +Contains Crimean-Congo hemorrhagic fever virus sample information and links to FastQ files stored in this repository. +Used to assess pipeline performance on non-contiguous viral genomes. + +#### `v3.0/samplesheet_test_hiv.csv` + +Sample sheet for HIV test data. +Contains HIV Illumina amplicon primer enriched sample information and links to FastQ files stored in this repository. +Data originate from multiple SRA experiments and are treated as non-amplicon data for testing purposes. ### `genome/` @@ -78,7 +101,35 @@ kraken2-build --db kraken2_hs22 --build - `GCA_014621585.1_ASM1462158v1_genomic..fna.gz`: Monkeypox genome fasta file downloaded directly via [NCBI FTP](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/621/585/GCA_014621585.1_ASM1462158v1/GCA_014621585.1_ASM1462158v1_genomic.fna.gz) - `GCA_014621585.1_ASM1462158v1_genomic..gff.gz`: Monkeypox genome GFF3 annotation file downloaded directly via [NCBI FTP](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/621/585/GCA_014621585.1_ASM1462158v1/GCA_014621585.1_ASM1462158v1_genomic.gff.gz) -### `fastq/illumina_sispa/` + +#### `crimea_congo` + +Crimea Congo is a fragmented genome with three fragments. S, M and L based on the fragment size. + +- `crimea_congo.fasta.gz`: Crimea Congo fasta genome containing S, M and L fragments: KY484036.1, KY484035.1, KY484034.1 +- `crimea_congo.gff.gz`: Crimea congo genome GFF3 annotation file containing annotation for S, M and L fragments: KY484036.1, KY484035.1, KY484034.1 + +#### `NC_001802.1` + +This reference was chosen based on [Nextclade's](https://clades.nextstrain.org/dataset) HIV reference which states: + +``` +This data set uses the NCBI reference sequence NC_001802 based on the HXB2 genome K03455. The primary reason for choosing it is to ensure amino acid substitutions in conserved proteins such as Pol are numbered consistently. Note that this sequence has a few problems, including a premature stop-codon in nef. +``` + +- `NC_001802.1.fasta`: Human immunodeficiency virus 1 genome fasta file downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/nuccore/NC_001802.1/) +- `NC_001802.1.gff`: Human immunodeficiency virus 1 genome GFF3 annotation file downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/nuccore/NC_001802.1/) + +#### `codfreq` + +This reference was generated using the [HIV JSON profile](https://github.com/hivdb/codfreq/blob/main/profiles/HIV1.json) from [codfreq](https://github.com/hivdb/codfreq) software. + +- `codfreq.fasta`: Was generated from the `"refSequence"` key of the .json file. +- `codonfreq.gff`: Was manually generated using the information from `"fragmentName"` and `"refRanges"` from `"fragmentConfig"`. + +This is the default reference used in the nf-core/viralrecon HIV resistance detection protocol for the resulting codon frequencies and codon coverages to be directly comparable to those produced by [**HIVdb**](https://hivdb.stanford.edu/hivdb/by-reads/), ensuring accurate interpretation of resistance data. + +### `illumina/sispa/` | file | num_seqs | sum_len | min_len | avg_len | max_len | file_size | Sequencer | LibrarySource | | ----------------------- | -------- | --------- | ------- | ------- | ------- | --------- | ----------- | ------------- | @@ -93,7 +144,7 @@ kraken2-build --db kraken2_hs22 --build > All FastQ files were sub-sampled to 0.02% of the original reads. -### `fastq/illumina_amplicon/` +### `illumina/amplicon` | file | num_seqs | sum_len | min_len | avg_len | max_len | file_size | Sequencer | LibrarySource | | ------------------- | -------- | --------- | ------- | ------- | ------- | --------- | ----------- | ------------- | @@ -104,8 +155,33 @@ kraken2-build --db kraken2_hs22 --build > All FastQ files were sub-sampled to 0.02% of the original reads. +### `illumina/hiv/` + +This dataset was chosen because it is the example data for [HIVdb Drug Resistance Database](https://hivdb.stanford.edu/hivdb/by-reads/): + +- DRR030302: Amplicon Whole Genome sequencing +- SRR4071760: Amplification of protease-RT genes +- SRR6937100: Amplification of integrase genes + +| file | num_seqs | sum_len | min_len | avg_len | max_len | file_size | Sequencer | LibrarySource | +| --------------------- | -------- | --------- | ------- | ------- | ------- | --------- | ----------- | ------------- | +| DRR030302_1.fastq.gz | 10,512 | 2,545,102 | 40 | 242 | 251 | 1.3M | PE Illumina | Viral RNA | +| DRR030302_1.fastq.gz | 10,512 | 2,545,205 | 40 | 242 | 251 | 1.8M | PE Illumina | Viral RNA | +| SRR4071760_1.fastq.gz | 10,582 | 2,524,863 | 45 | 238 | 251 | 1M | PE Illumina | Synthetic | +| SRR4071760_2.fastq.gz | 10,582 | 2,525,284 | 45 | 238 | 251 | 1.4M | PE Illumina | Synthetic | +| SRR6937100_1.fastq.gz | 10,484 | 1,295,077 | 35 | 123 | 151 | 556K | PE Illumina | Genomic | +| SRR6937100_2.fastq.gz | 10,484 | 1,289,631 | 33 | 123 | 151 | 612K | PE Illumina | Genomic | + +> Original FastQ files were sub-sampled as explained in [Sampling procedure](#sampling-procedure) + +### `illumina/fragmented/` + +TBD + ## Sampling procedure +### SARS-CoV-2 + Prepare a file `list.txt` with the following SRA accession numbers: ``` @@ -129,6 +205,21 @@ parallel 'seqkit sample -p 0.02 -s 2020 {} | pigz > {.}.fastq.gz' ::: SRR* The above tools are available on bioconda. + +### HIV + +The data was downsampled after Human Genome reads removal using different proportions: + +- DRR030302: 0.025 +- SRR4071760: 0.11 +- SRR6937100: 0.55 + +We used the following commands: + +```bash +seqtk sample -s100 +``` + ## Expected output TBD. diff --git a/genome/codfreq/codfreq.fasta b/genome/codfreq/codfreq.fasta new file mode 100644 index 000000000..0415c368d --- /dev/null +++ b/genome/codfreq/codfreq.fasta @@ -0,0 +1,123 @@ +>HXB2_x_ConsensusB | generated from codfreq using this json file https://github.com/hivdb/codfreq/blob/main/profiles/HIV1.json +TGGAAGGGCTAATTCACTCCCAACGAAGACAAGATATCCTTGATCTGTGGATCTACCACACACAAGGCTACTTCCCTGAT +TGGCAGAACTACACACCAGGGCCAGGGATCAGATATCCACTGACCTTTGGATGGTGCTACAAGCTAGTACCAGTTGAGCC +AGAGAAGTTAGAAGAAGCCAACAAAGGAGAGAACACCAGCTTGTTACACCCTGTGAGCCTGCATGGAATGGATGACCCGG +AGAGAGAAGTGTTAGAGTGGAGGTTTGACAGCCGCCTAGCATTTCATCACATGGCCCGAGAGCTGCATCCGGAGTACTTC +AAGAACTGCTGACATCGAGCTTGCTACAAGGGACTTTCCGCTGGGGACTTTCCAGGGAGGCGTGGCCTGGGCGGGACTGG +GGAGTGGCGAGCCCTCAGATCCTGCATATAAGCAGCTGCTTTTTGCCTGTACTGGGTCTCTCTGGTTAGACCAGATCTGA +GCCTGGGAGCTCTCTGGCTAACTAGGGAACCCACTGCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTG +TGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCG +CCCGAACAGGGACCTGAAAGCGAAAGGGAAACCAGAGGAGCTCTCTCGACGCAGGACTCGGCTTGCTGAAGCGCGCACGG +CAAGAGGCGAGGGGCGGCGACTGGTGAGTACGCCAAAAATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCGAG +AGCGTCAGTATTAAGCGGGGGAGAATTAGATAGATGGGAAAAAATTCGGTTAAGGCCAGGGGGAAAGAAAAAATATAAAT +TAAAACATATAGTATGGGCAAGCAGGGAGCTAGAACGATTCGCAGTTAATCCTGGCCTGTTAGAAACATCAGAAGGCTGT +AGACAAATACTGGGACAGCTACAACCATCCCTTCAGACAGGATCAGAAGAACTTAGATCATTATATAATACAGTAGCAAC +CCTCTATTGTGTGCATCAAAGGATAGAGGTAAAAGACACCAAGGAAGCTTTAGAGAAGATAGAGGAAGAGCAAAACAAAA +GTAAGAAAAAAGCACAGCAAGCAGCAGCTGACACAGGAAACAGCAGCCAGGTCAGCCAAAATTACCCTATAGTGCAGAAC +CTCCAGGGGCAAATGGTACATCAGGCCATATCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTT +CAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAG +TGGGGGGACATCAAGCAGCCATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCA +GTGCATGCAGGGCCTATTGCACCAGGCCAGATGAGAGAACCAAGGGGAAGTGACATAGCAGGAACTACTAGTACCCTTCA +GGAACAAATAGGATGGATGACAAATAATCCACCTATCCCAGTAGGAGAAATCTATAAAAGATGGATAATCCTGGGATTAA +ATAAAATAGTAAGAATGTATAGCCCTACCAGCATTCTGGACATAAGACAAGGACCAAAGGAACCCTTTAGAGACTATGTA +GACCGGTTCTATAAAACTCTAAGAGCCGAGCAAGCTTCACAGGAGGTAAAAAATTGGATGACAGAAACCTTGTTGGTCCA +AAATGCGAACCCAGATTGTAAGACTATTTTAAAAGCATTGGGACCAGCAGCTACACTAGAAGAAATGATGACAGCATGTC +AGGGAGTGGGAGGACCCGGCCATAAAGCAAGAGTTTTGGCTGAAGCAATGAGCCAAGTAACAAATTCAGCTACCATAATG +ATGCAGAGAGGCAATTTTAGGAACCAAAGAAAGACTGTTAAGTGTTTCAATTGTGGCAAAGAAGGGCACATAGCCAAAAA +TTGCAGGGCCCCTAGGAAAAAGGGCTGTTGGAAATGTGGAAAGGAAGGACACCAAATGAAAGATTGTACTGAGAGACAGG +CTAATTTTTTAGGGAAGATCTGGCCTTCCCACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGACCAGAGCCAACAGCC +CCACCAGAAGAGAGCTTCAGGTTTGGGGAAGAGACAACAACTCCCTCTCAGAAGCAGGAGCCGATAGACAAGGAACTGTA +TCCTTTAGCTTCCCTCAGATCACTCTTTGGCAACGACCCCTCGTCACAATAAAGATAGGGGGGCAACTAAAGGAAGCTCT +ATTAGATACAGGAGCAGATGATACAGTATTAGAAGAAATGAATTTGCCAGGAAGATGGAAACCAAAAATGATAGGGGGAA +TTGGAGGTTTTATCAAAGTAAGACAGTATGATCAGATACTCATAGAAATCTGTGGACATAAAGCTATAGGTACAGTATTA +GTAGGACCTACACCTGTCAACATAATTGGAAGAAATCTGTTGACTCAGATTGGTTGCACTTTAAATTTTCCCATTAGTCC +TATTGAAACTGTACCAGTAAAATTAAAGCCAGGAATGGATGGCCCAAAAGTTAAACAATGGCCATTGACAGAAGAAAAAA +TAAAAGCATTAGTAGAAATTTGTACAGAAATGGAAAAGGAAGGGAAAATTTCAAAAATTGGGCCTGAAAATCCATACAAT +ACTCCAGTATTTGCCATAAAGAAAAAAGACAGTACTAAATGGAGAAAATTAGTAGATTTCAGAGAACTTAATAAGAGAAC +TCAAGACTTCTGGGAAGTTCAATTAGGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGATG +TGGGTGATGCATATTTTTCAGTTCCCTTAGATAAAGACTTCAGGAAGTATACTGCATTTACCATACCTAGTATAAACAAT +GAGACACCAGGGATTAGATATCAGTACAATGTGCTTCCACAGGGATGGAAAGGATCACCAGCAATATTCCAAAGTAGCAT +GACAAAAATCTTAGAGCCTTTTAGAAAACAAAATCCAGACATAGTTATCTATCAATACATGGATGATTTGTATGTAGGAT +CTGACTTAGAAATAGGGCAGCATAGAACAAAAATAGAGGAACTGAGACAACATCTGTTGAGGTGGGGATTTACCACACCA +GACAAAAAACATCAGAAAGAACCTCCATTCCTTTGGATGGGTTATGAACTCCATCCTGATAAATGGACAGTACAGCCTAT +AGTGCTGCCAGAAAAAGACAGCTGGACTGTCAATGACATACAGAAGTTAGTGGGAAAATTGAATTGGGCAAGTCAGATTT +ATGCAGGGATTAAAGTAAAGCAATTATGTAAACTCCTTAGGGGAACCAAAGCACTAACAGAAGTAATACCACTAACAGAA +GAAGCAGAGCTAGAACTGGCAGAAAACAGGGAGATTCTAAAAGAACCAGTACATGGAGTGTATTATGACCCATCAAAAGA +CTTAATAGCAGAAATACAGAAGCAGGGGCAAGGCCAATGGACATATCAAATTTATCAAGAGCCATTTAAAAATCTGAAAA +CAGGAAAGTATGCAAGAATGAGGGGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAAAAATAGCCACA +GAAAGCATAGTAATATGGGGAAAGACTCCTAAATTTAAACTACCCATACAAAAAGAAACATGGGAAGCATGGTGGACAGA +GTATTGGCAAGCCACCTGGATTCCTGAGTGGGAGTTTGTCAATACCCCTCCCTTAGTGAAATTATGGTACCAGTTAGAGA +AAGAACCCATAGTAGGAGCAGAAACTTTCTATGTAGATGGGGCAGCTAATAGGGAGACTAAATTAGGAAAAGCAGGATAT +GTTACTGACAGAGGAAGACAAAAAGTTGTCTCCCTAACTGACACAACAAATCAGAAGACTGAGTTACAAGCAATTCATCT +AGCTTTGCAGGATTCGGGATTAGAAGTAAACATAGTAACAGACTCACAATATGCATTAGGAATCATTCAAGCACAACCAG +ATAAAAGTGAATCAGAGTTAGTCAGTCAAATAATAGAGCAGTTAATAAAAAAGGAAAAGGTCTACCTGGCATGGGTACCA +GCACACAAAGGAATTGGAGGAAATGAACAAGTAGATAAATTAGTCAGTGCTGGAATCAGGAAAGTACTATTTTTAGATGG +AATAGATAAGGCCCAAGAAGAACATGAGAAATATCACAGTAATTGGAGAGCAATGGCTAGTGATTTTAACCTGCCACCTG +TAGTAGCAAAAGAAATAGTAGCCAGCTGTGATAAATGTCAGCTAAAAGGAGAAGCCATGCATGGACAAGTAGACTGTAGT +CCAGGAATATGGCAACTAGATTGTACACATTTAGAAGGAAAAATTATCCTGGTAGCAGTTCATGTAGCCAGTGGATATAT +AGAAGCAGAAGTTATTCCAGCAGAGACAGGGCAGGAAACAGCATACTTTCTCTTAAAATTAGCAGGAAGATGGCCAGTAA +AAACAATACATACAGACAATGGCAGCAATTTCACCAGTACTACGGTTAAGGCCGCCTGTTGGTGGGCAGGGATCAAGCAG +GAATTTGGCATTCCCTACAATCCCCAAAGTCAAGGAGTAGTAGAATCTATGAATAAAGAATTAAAGAAAATTATAGGACA +GGTAAGAGATCAGGCTGAACATCTTAAGACAGCAGTACAAATGGCAGTATTCATCCACAATTTTAAAAGAAAAGGGGGGA +TTGGGGGGTACAGTGCAGGGGAAAGAATAGTAGACATAATAGCAACAGACATACAAACTAAAGAATTACAAAAACAAATT +ACAAAAATTCAAAATTTTCGGGTTTATTACAGGGACAGCAGAGATCCACTTTGGAAAGGACCAGCAAAGCTTCTCTGGAA +AGGTGAAGGGGCAGTAGTAATACAAGATAATAGTGACATAAAAGTAGTGCCAAGAAGAAAAGCAAAGATCATTAGGGATT +ATGGAAAACAGATGGCAGGTGATGATTGTGTGGCAAGTAGACAGGATGAGGATTAGAACATGGAAAAGTTTAGTAAAACA +CCATATGTATGTTTCAGGGAAAGCTAGGGGATGGTTTTATAGACATCACTATGAAAGCCCTCATCCAAGAATAAGTTCAG +AAGTACACATCCCACTAGGGGATGCTAGATTGGTAATAACAACATATTGGGGTCTGCATACAGGAGAAAGAGACTGGCAT +TTGGGTCAGGGAGTCTCCATAGAATGGAGGAAAAAGAGATATAGCACACAAGTAGACCCTGAACTAGCAGACCAACTAAT +TCATCTGTATTACTTTGACTGTTTTTCAGACTCTGCTATAAGAAAGGCCTTATTAGGACACATAGTTAGCCCTAGGTGTG +AATATCAAGCAGGACATAACAAGGTAGGATCTCTACAATACTTGGCACTAGCAGCATTAATAACACCAAAAAAGATAAAG +CCACCTTTGCCTAGTGTTACGAAACTGACAGAGGATAGATGGAACAAGCCCCAGAAGACCAAGGGCCACAGAGGGAGCCA +CACAATGAATGGACACTAGAGCTTTTAGAGGAGCTTAAGAATGAAGCTGTTAGACATTTTCCTAGGATTTGGCTCCATGG +CTTAGGGCAACATATCTATGAAACTTATGGGGATACTTGGGCAGGAGTGGAAGCCATAATAAGAATTCTGCAACAACTGC +TGTTTATCCATTTTCAGAATTGGGTGTCGACATAGCAGAATAGGCGTTACTCGACAGAGGAGAGCAAGAAATGGAGCCAG +TAGATCCTAGACTAGAGCCCTGGAAGCATCCAGGAAGTCAGCCTAAAACTGCTTGTACCAATTGCTATTGTAAAAAGTGT +TGCTTTCATTGCCAAGTTTGTTTCATAACAAAAGCCTTAGGCATCTCCTATGGCAGGAAGAAGCGGAGACAGCGACGAAG +AGCTCATCAGAACAGTCAGACTCATCAAGCTTCTCTATCAAAGCAGTAAGTAGTACATGTAACGCAACCTATACCAATAG +TAGCAATAGTAGCATTAGTAGTAGCAATAATAATAGCAATAGTTGTGTGGTCCATAGTAATCATAGAATATAGGAAAATA +TTAAGACAAAGAAAAATAGACAGGTTAATTGATAGACTAATAGAAAGAGCAGAAGACAGTGGCAATGAGAGTGAAGGAGA +AATATCAGCACTTGTGGAGATGGGGGTGGAGATGGGGCACCATGCTCCTTGGGATGTTGATGATCTGTAGTGCTACAGAA +AAATTGTGGGTCACAGTCTATTATGGGGTACCTGTGTGGAAGGAAGCAACCACCACTCTATTTTGTGCATCAGATGCTAA +AGCATATGATACAGAGGTACATAATGTTTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAGTAGTAT +TGGTAAATGTGACAGAAAATTTTAACATGTGGAAAAATGACATGGTAGAACAGATGCATGAGGATATAATCAGTTTATGG +GATCAAAGCCTAAAGCCATGTGTAAAATTAACCCCACTCTGTGTTAGTTTAAAGTGCACTGATTTGAAGAATGATACTAA +TACCAATAGTAGTAGCGGGAGAATGATAATGGAGAAAGGAGAGATAAAAAACTGCTCTTTCAATATCAGCACAAGCATAA +GAGGTAAGGTGCAGAAAGAATATGCATTTTTTTATAAACTTGATATAATACCAATAGATAATGATACTACCAGCTATAAG +TTGACAAGTTGTAACACCTCAGTCATTACACAGGCCTGTCCAAAGGTATCCTTTGAGCCAATTCCCATACATTATTGTGC +CCCGGCTGGTTTTGCGATTCTAAAATGTAATAATAAGACGTTCAATGGAACAGGACCATGTACAAATGTCAGCACAGTAC +AATGTACACATGGAATTAGGCCAGTAGTATCAACTCAACTGCTGTTAAATGGCAGTCTAGCAGAAGAAGAGGTAGTAATT +AGATCTGTCAATTTCACGGACAATGCTAAAACCATAATAGTACAGCTGAACACATCTGTAGAAATTAATTGTACAAGACC +CAACAACAATACAAGAAAAAGAATCCGTATCCAGAGAGGACCAGGGAGAGCATTTGTTACAATAGGAAAAATAGGAAATA +TGAGACAAGCACATTGTAACATTAGTAGAGCAAAATGGAATAACACTTTAAAACAGATAGCTAGCAAATTAAGAGAACAA +TTTGGAAATAATAAAACAATAATCTTTAAGCAATCCTCAGGAGGGGACCCAGAAATTGTAACGCACAGTTTTAATTGTGG +AGGGGAATTTTTCTACTGTAATTCAACACAACTGTTTAATAGTACTTGGTTTAATAGTACTTGGAGTACTGAAGGGTCAA +ATAACACTGAAGGAAGTGACACAATCACCCTCCCATGCAGAATAAAACAAATTATAAACATGTGGCAGAAAGTAGGAAAA +GCAATGTATGCCCCTCCCATCAGTGGACAAATTAGATGTTCATCAAATATTACAGGGCTGCTATTAACAAGAGATGGTGG +TAATAGCAACAATGAGTCCGAGATCTTCAGACCTGGAGGAGGAGATATGAGGGACAATTGGAGAAGTGAATTATATAAAT +ATAAAGTAGTAAAAATTGAACCATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAGTGGTGCAGAGAGAAAAAAGAGCA +GTGGGAATAGGAGCTTTGTTCCTTGGGTTCTTGGGAGCAGCAGGAAGCACTATGGGCGCAGCCTCAATGACGCTGACGGT +ACAGGCCAGACAATTATTGTCTGGTATAGTGCAGCAGCAGAACAATTTGCTGAGGGCTATTGAGGCGCAACAGCATCTGT +TGCAACTCACAGTCTGGGGCATCAAGCAGCTCCAGGCAAGAATCCTGGCTGTGGAAAGATACCTAAAGGATCAACAGCTC +CTGGGGATTTGGGGTTGCTCTGGAAAACTCATTTGCACCACTGCTGTGCCTTGGAATGCTAGTTGGAGTAATAAATCTCT +GGAACAGATTTGGAATCACACGACCTGGATGGAGTGGGACAGAGAAATTAACAATTACACAAGCTTAATACACTCCTTAA +TTGAAGAATCGCAAAACCAGCAAGAAAAGAATGAACAAGAATTATTGGAATTAGATAAATGGGCAAGTTTGTGGAATTGG +TTTAACATAACAAATTGGCTGTGGTATATAAAATTATTCATAATGATAGTAGGAGGCTTGGTAGGTTTAAGAATAGTTTT +TGCTGTACTTTCTATAGTGAATAGAGTTAGGCAGGGATATTCACCATTATCGTTTCAGACCCACCTCCCAACCCCGAGGG +GACCCGACAGGCCCGAAGGAATCGAAGAAGAAGGTGGAGAGAGAGACAGAGACAGATCCATTCGATTAGTGAACGGATCC +TTGGCACTTATCTGGGACGATCTGCGGAGCCTGTGCCTCTTCAGCTACCACCGCTTGAGAGACTTACTCTTGATTGTAAC +GAGGATTGTGGAACTTCTGGGACGCAGGGGGTGGGAAGCCCTCAAATATTGGTGGAATCTCCTACAGTATTGGAGTCAGG +AACTAAAGAATAGTGCTGTTAGCTTGCTCAATGCCACAGCCATAGCAGTAGCTGAGGGGACAGATAGGGTTATAGAAGTA +GTACAAGGAGCTTGTAGAGCTATTCGCCACATACCTAGAAGAATAAGACAGGGCTTGGAAAGGATTTTGCTATAAGATGG +GTGGCAAGTGGTCAAAAAGTAGTGTGATTGGATGGCCTACTGTAAGGGAAAGAATGAGACGAGCTGAGCCAGCAGCAGAT +AGGGTGGGAGCAGCATCTCGAGACCTGGAAAAACATGGAGCAATCACAAGTAGCAATACAGCAGCTACCAATGCTGCTTG +TGCCTGGCTAGAAGCACAAGAGGAGGAGGAGGTGGGTTTTCCAGTCACACCTCAGGTACCTTTAAGACCAATGACTTACA +AGGCAGCTGTAGATCTTAGCCACTTTTTAAAAGAAAAGGGGGGACTGGAAGGGCTAATTCACTCCCAAAGAAGACAAGAT +ATCCTTGATCTGTGGATCTACCACACACAAGGCTACTTCCCTGATTGGCAGAACTACACACCAGGGCCAGGGGTCAGATA +TCCACTGACCTTTGGATGGTGCTACAAGCTAGTACCAGTTGAGCCAGATAAGATAGAAGAGGCCAATAAAGGAGAGAACA +CCAGCTTGTTACACCCTGTGAGCCTGCATGGGATGGATGACCCGGAGAGAGAAGTGTTAGAGTGGAGGTTTGACAGCCGC +CTAGCATTTCATCACGTGGCCCGAGAGCTGCATCCGGAGTACTTCAAGAACTGCTGACATCGAGCTTGCTACAAGGGACT +TTCCGCTGGGGACTTTCCAGGGAGGCGTGGCCTGGGCGGGACTGGGGAGTGGCGAGCCCTCAGATCCTGCATATAAGCAG +CTGCTTTTTGCCTGTACTGGGTCTCTCTGGTTAGACCAGATCTGAGCCTGGGAGCTCTCTGGCTAACTAGGGAACCCACT +GCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGAT +CCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCA diff --git a/genome/codfreq/codfreq.gff b/genome/codfreq/codfreq.gff new file mode 100644 index 000000000..b5d2f225b --- /dev/null +++ b/genome/codfreq/codfreq.gff @@ -0,0 +1,25 @@ +##gff-version 3 +# Liftoff v1.6.3 +# /usr/local/bin/liftoff -g codfreq.gff -p 4 -o codfreq.gff -u codfreq.unmapped.txt target.fasta reference.fasta +HXB2_x_ConsensusB Liftoff gene 790 2289 . + . ID=gene-gag;Name=gag;gbkey=Gene;gene=gag;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_stop_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-gag_0 +HXB2_x_ConsensusB Liftoff CDS 790 2289 . + . ID=cds-gag;Parent=gene-gag;gbkey=CDS;gene=gag;product=gag protein;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 1186 1878 . + . ID=gene-CA;Name=CA;gbkey=Gene;gene=CA;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_start_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-CA_0 +HXB2_x_ConsensusB Liftoff CDS 1186 1878 . + . ID=cds-CA;Parent=gene-CA;gbkey=CDS;gene=CA;product=CA protein;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 2088 5093 . + . ID=gene-pol;Name=pol;gbkey=Gene;gene=pol;partial=true;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_start_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-pol_0 +HXB2_x_ConsensusB Liftoff CDS 2088 5093 . + . ID=cds-pol;Parent=gene-pol;gbkey=CDS;gene=pol;partial=true;product=pol protein;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 2253 2549 . + . ID=gene-PR;Name=PR;gbkey=Gene;gene=PR;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_start_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-PR_0 +HXB2_x_ConsensusB Liftoff CDS 2253 2549 . + . ID=cds-PR;Parent=gene-PR;Name=PR;gbkey=CDS;gene=PR;product=protease;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 2550 4229 . + . ID=gene-RT;Name=RT;gbkey=Gene;gene=RT;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_start_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-RT_0 +HXB2_x_ConsensusB Liftoff CDS 2550 4229 . + . ID=cds-RT;Parent=gene-RT;Name=RT;gbkey=CDS;gene=RT;product=reverse transcriptase;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 4230 5093 . + . ID=gene-IN;Name=IN;gbkey=Gene;gene=IN;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_start_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-IN_0 +HXB2_x_ConsensusB Liftoff CDS 4230 5093 . + . ID=cds-IN;Parent=gene-IN;Name=IN;gbkey=CDS;gene=IN;product=integrase;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 5041 5616 . + . ID=gene-vif;Name=vif;gbkey=Gene;gene=vif;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_stop_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-vif_0 +HXB2_x_ConsensusB Liftoff CDS 5041 5616 . + . ID=cds-vif;Parent=gene-vif;gbkey=CDS;gene=vif;product=vif protein;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 5559 5847 . + . ID=gene-vpr;Name=vpr;gbkey=Gene;gene=vpr;coverage=1.0;sequence_ID=1.0;extra_copy_number=0;copy_num_ID=gene-vpr_0 +HXB2_x_ConsensusB Liftoff gene 5831 8466 . + . ID=gene-tat;Name=tat;gbkey=Gene;gene=tat;coverage=1.0;sequence_ID=1.0;extra_copy_number=0;copy_num_ID=gene-tat_0 +HXB2_x_ConsensusB Liftoff gene 5970 8650 . + . ID=gene-rev;Name=rev;gbkey=Gene;gene=rev;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_stop_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-rev_0 +HXB2_x_ConsensusB Liftoff CDS 5970 8650 . + . ID=cds-rev;Parent=gene-rev;gbkey=CDS;gene=rev;product=rev protein;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 6062 6307 . + . ID=gene-vpu;Name=vpu;gbkey=Gene;gene=vpu;coverage=1.0;sequence_ID=1.0;extra_copy_number=0;copy_num_ID=gene-vpu_0 +HXB2_x_ConsensusB Liftoff gene 6225 8792 . + . ID=gene-env;Name=env;gbkey=Gene;gene=env;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_stop_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-env_0 +HXB2_x_ConsensusB Liftoff CDS 6225 8792 . + . ID=cds-env;Parent=gene-env;gbkey=CDS;gene=env;product=envelope glycoprotein;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 8797 9414 . + . ID=gene-nef;Name=nef;gbkey=Gene;gene=nef;coverage=1.0;sequence_ID=1.0;extra_copy_number=0;copy_num_ID=gene-nef_0