From 0aa36e648cabf7089c97b8adebe4c06b75cefb44 Mon Sep 17 00:00:00 2001 From: svarona Date: Tue, 30 Sep 2025 14:51:32 +0200 Subject: [PATCH 1/8] updated documentation --- README.md | 96 +++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 89 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 0229a3157..f325edd70 100644 --- a/README.md +++ b/README.md @@ -12,23 +12,45 @@ Sample information sheet required to test the pipeline containing sample names a #### `samplesheet_test_sra.csv` -Sample information sheet required to test the pipeline containing sample information of one link to a original full FastQ files and two files that must be downloaded from SRA one single-end and one paired-end, respectively. This sample sheet corresponds to Illumina SISPA data. +Sample information sheet required to test the pipeline containing sample information of one link to a original full FastQ files and two files that must be downloaded from SRA one single-end and one paired-end, respectively. + +This sample sheet corresponds to SARS-CoV-2 Illumina SISPA data. #### `samplesheet_test_illumina_sispa.csv` -Sample information sheet required to test the pipeline containing sample information and links to FastQ files stored in this repository. This sample sheet corresponds to Illumina SISPA data. +Sample information sheet required to test the pipeline containing sample information and links to FastQ files stored in this repository. + +This sample sheet corresponds to SARS-CoV-2 Illumina SISPA data. #### `samplesheet_test_illumina_amplicon.csv` -Sample information sheet required to test the pipeline containing sample information and links to FastQ files stored in this repository. This sample sheet corresponds to Illumina amplicon data. +Sample information sheet required to test the pipeline containing sample information and links to FastQ files stored in this repository. + +This sample sheet corresponds to SARS-CoV-2 Illumina amplicon primer enrichment data. #### `samplesheet_full_illumina_sispa.csv` -Sample information sheet required to test the pipeline containing sample information and links to original full FastQ files. This sample sheet corresponds to Illumina SISPA data. +Sample information sheet required to test the pipeline containing sample information and links to original full FastQ files. + +This sample sheet corresponds to SARS-CoV-2 Illumina SISPA data. #### `samplesheet_full_illumina_amplicon.csv` -Sample information sheet required to test the pipeline containing sample information and links to original full FastQ files. This sample sheet corresponds to Illumina amplicon data. +Sample information sheet required to test the pipeline containing sample information and links to original full FastQ files. + +This sample sheet corresponds to SARS-CoV-2 Illumina amplicon primer enrichment data. + +#### `samplesheet_full_illumina_fragmented.csv` + +Sample information sheet required to test the pipeline containing sample information and links to FastQ files stored in this repository. + +This sample sheet corresponds to Crimea Congo data. + +#### `v3.0/samplesheet_test_hiv.csv` + +Sample information sheet required to test the pipeline containing sample information and links to FastQ files stored in this repository. + +This sample sheet corresponds to HIV Illumina amplicon primer enrichment data from different SRA experiments. For test purposes these will be trated as non amplicon data. ### `genome/` @@ -78,7 +100,27 @@ kraken2-build --db kraken2_hs22 --build - `GCA_014621585.1_ASM1462158v1_genomic..fna.gz`: Monkeypox genome fasta file downloaded directly via [NCBI FTP](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/621/585/GCA_014621585.1_ASM1462158v1/GCA_014621585.1_ASM1462158v1_genomic.fna.gz) - `GCA_014621585.1_ASM1462158v1_genomic..gff.gz`: Monkeypox genome GFF3 annotation file downloaded directly via [NCBI FTP](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/014/621/585/GCA_014621585.1_ASM1462158v1/GCA_014621585.1_ASM1462158v1_genomic.gff.gz) -### `fastq/illumina_sispa/` + +#### `crimea_congo` + +Crimea Congo is a fragmented genome with three fragments. S, M and L based on the fragment size. + +- `crimea_congo.fasta.gz`: Crimea Congo fasta genome containing S, M and L fragments: KY484036.1, KY484035.1, KY484034.1 +- `crimea_congo.gff.gz`: Crimea congo genome GFF3 annotation file containing annotation for S, M and L fragments: KY484036.1, KY484035.1, KY484034.1 + +#### `NC_001802.1` + +This reference was chosen based on [Nextclade's](https://clades.nextstrain.org/dataset) HIV reference which states: + +``` +This data set uses the NCBI reference sequence NC_001802 based on the HXB2 genome K03455. The primary reason for choosing it is to ensure amino acid substitutions in conserved proteins such as Pol are numbered consistently. Note that this sequence as a few problems, including a premature stop-codon in nef. +``` + +- `NC_001802.1.fasta`: Human immunodeficiency virus 1 genome fasta file downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/nuccore/NC_001802.1/) +- `NC_001802.1.gff`: Human immunodeficiency virus 1 genome GFF3 annotation file downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/nuccore/NC_001802.1/) + + +### `illumina/sispa/` | file | num_seqs | sum_len | min_len | avg_len | max_len | file_size | Sequencer | LibrarySource | | ----------------------- | -------- | --------- | ------- | ------- | ------- | --------- | ----------- | ------------- | @@ -93,7 +135,7 @@ kraken2-build --db kraken2_hs22 --build > All FastQ files were sub-sampled to 0.02% of the original reads. -### `fastq/illumina_amplicon/` +### `illumina/amplicon` | file | num_seqs | sum_len | min_len | avg_len | max_len | file_size | Sequencer | LibrarySource | | ------------------- | -------- | --------- | ------- | ------- | ------- | --------- | ----------- | ------------- | @@ -104,8 +146,33 @@ kraken2-build --db kraken2_hs22 --build > All FastQ files were sub-sampled to 0.02% of the original reads. +### `illumina/hiv/` + +This dasatet was chosen because it is the example data for [HIVdb Drug Resistance Database](https://hivdb.stanford.edu/hivdb/by-reads/): + +- DRR030302: Amplicon Whole Genome sequencing +- SRR4071760: Amplification of protease-RT genes +- SRR6937100: Amplification integrase gene + +| file | num_seqs | sum_len | min_len | avg_len | max_len | file_size | Sequencer | LibrarySource | +| --------------------- | -------- | --------- | ------- | ------- | ------- | --------- | ----------- | ------------- | +| DRR030302_1.fastq.gz | 10,512 | 2,545,102 | 40 | 242 | 251 | 1.3M | PE Illumina | Viral RNA | +| DRR030302_1.fastq.gz | 10,512 | 2,545,205 | 40 | 242 | 251 | 1.8M | PE Illumina | Viral RNA | +| SRR4071760_1.fastq.gz | 10,582 | 2,524,863 | 45 | 238 | 251 | 1M | PE Illumina | Synthetic | +| SRR4071760_2.fastq.gz | 10,582 | 2,525,284 | 45 | 238 | 251 | 1.4M | PE Illumina | Synthetic | +| SRR6937100_1.fastq.gz | 10,484 | 1,295,077 | 35 | 123 | 151 | 556K | PE Illumina | Genomic | +| SRR6937100_2.fastq.gz | 10,484 | 1,289,631 | 33 | 123 | 151 | 612K | PE Illumina | Genomic | + +> Original FastQ files were sub-sampled as explained in [Sampling procedure](#sampling-procedure) + +### `illumina/fragmented/` + +TBD + ## Sampling procedure +### SARS-CoV-2 + Prepare a file `list.txt` with the following SRA accession numbers: ``` @@ -129,6 +196,21 @@ parallel 'seqkit sample -p 0.02 -s 2020 {} | pigz > {.}.fastq.gz' ::: SRR* The above tools are available on bioconda. + +### HIV + +The data was downsampled after Human Genome reads removal using different proportions: + +- DRR030302: 0.025 +- SRR4071760: 0.11 +- SRR6937100: 0.55 + +We used the following commands: + +```bash +seqtk sample -s100 +``` + ## Expected output TBD. From 419794301a3f9ed9655a26f0fb48e55f300f80db Mon Sep 17 00:00:00 2001 From: svarona Date: Tue, 4 Nov 2025 15:43:32 +0100 Subject: [PATCH 2/8] added codfreq references --- genome/codfreq/codfreq.fasta | 123 +++++++++++++++++++++++++++++++++++ genome/codfreq/codfreq.gff | 25 +++++++ 2 files changed, 148 insertions(+) create mode 100644 genome/codfreq/codfreq.fasta create mode 100644 genome/codfreq/codfreq.gff diff --git a/genome/codfreq/codfreq.fasta b/genome/codfreq/codfreq.fasta new file mode 100644 index 000000000..0415c368d --- /dev/null +++ b/genome/codfreq/codfreq.fasta @@ -0,0 +1,123 @@ +>HXB2_x_ConsensusB | generated from codfreq using this json file https://github.com/hivdb/codfreq/blob/main/profiles/HIV1.json +TGGAAGGGCTAATTCACTCCCAACGAAGACAAGATATCCTTGATCTGTGGATCTACCACACACAAGGCTACTTCCCTGAT +TGGCAGAACTACACACCAGGGCCAGGGATCAGATATCCACTGACCTTTGGATGGTGCTACAAGCTAGTACCAGTTGAGCC +AGAGAAGTTAGAAGAAGCCAACAAAGGAGAGAACACCAGCTTGTTACACCCTGTGAGCCTGCATGGAATGGATGACCCGG +AGAGAGAAGTGTTAGAGTGGAGGTTTGACAGCCGCCTAGCATTTCATCACATGGCCCGAGAGCTGCATCCGGAGTACTTC +AAGAACTGCTGACATCGAGCTTGCTACAAGGGACTTTCCGCTGGGGACTTTCCAGGGAGGCGTGGCCTGGGCGGGACTGG +GGAGTGGCGAGCCCTCAGATCCTGCATATAAGCAGCTGCTTTTTGCCTGTACTGGGTCTCTCTGGTTAGACCAGATCTGA +GCCTGGGAGCTCTCTGGCTAACTAGGGAACCCACTGCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTG +TGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCG +CCCGAACAGGGACCTGAAAGCGAAAGGGAAACCAGAGGAGCTCTCTCGACGCAGGACTCGGCTTGCTGAAGCGCGCACGG +CAAGAGGCGAGGGGCGGCGACTGGTGAGTACGCCAAAAATTTTGACTAGCGGAGGCTAGAAGGAGAGAGATGGGTGCGAG +AGCGTCAGTATTAAGCGGGGGAGAATTAGATAGATGGGAAAAAATTCGGTTAAGGCCAGGGGGAAAGAAAAAATATAAAT +TAAAACATATAGTATGGGCAAGCAGGGAGCTAGAACGATTCGCAGTTAATCCTGGCCTGTTAGAAACATCAGAAGGCTGT +AGACAAATACTGGGACAGCTACAACCATCCCTTCAGACAGGATCAGAAGAACTTAGATCATTATATAATACAGTAGCAAC +CCTCTATTGTGTGCATCAAAGGATAGAGGTAAAAGACACCAAGGAAGCTTTAGAGAAGATAGAGGAAGAGCAAAACAAAA +GTAAGAAAAAAGCACAGCAAGCAGCAGCTGACACAGGAAACAGCAGCCAGGTCAGCCAAAATTACCCTATAGTGCAGAAC +CTCCAGGGGCAAATGGTACATCAGGCCATATCACCTAGAACTTTAAATGCATGGGTAAAAGTAGTAGAAGAGAAGGCTTT +CAGCCCAGAAGTAATACCCATGTTTTCAGCATTATCAGAAGGAGCCACCCCACAAGATTTAAACACCATGCTAAACACAG +TGGGGGGACATCAAGCAGCCATGCAAATGTTAAAAGAGACCATCAATGAGGAAGCTGCAGAATGGGATAGATTGCATCCA +GTGCATGCAGGGCCTATTGCACCAGGCCAGATGAGAGAACCAAGGGGAAGTGACATAGCAGGAACTACTAGTACCCTTCA +GGAACAAATAGGATGGATGACAAATAATCCACCTATCCCAGTAGGAGAAATCTATAAAAGATGGATAATCCTGGGATTAA +ATAAAATAGTAAGAATGTATAGCCCTACCAGCATTCTGGACATAAGACAAGGACCAAAGGAACCCTTTAGAGACTATGTA +GACCGGTTCTATAAAACTCTAAGAGCCGAGCAAGCTTCACAGGAGGTAAAAAATTGGATGACAGAAACCTTGTTGGTCCA +AAATGCGAACCCAGATTGTAAGACTATTTTAAAAGCATTGGGACCAGCAGCTACACTAGAAGAAATGATGACAGCATGTC +AGGGAGTGGGAGGACCCGGCCATAAAGCAAGAGTTTTGGCTGAAGCAATGAGCCAAGTAACAAATTCAGCTACCATAATG +ATGCAGAGAGGCAATTTTAGGAACCAAAGAAAGACTGTTAAGTGTTTCAATTGTGGCAAAGAAGGGCACATAGCCAAAAA +TTGCAGGGCCCCTAGGAAAAAGGGCTGTTGGAAATGTGGAAAGGAAGGACACCAAATGAAAGATTGTACTGAGAGACAGG +CTAATTTTTTAGGGAAGATCTGGCCTTCCCACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGACCAGAGCCAACAGCC +CCACCAGAAGAGAGCTTCAGGTTTGGGGAAGAGACAACAACTCCCTCTCAGAAGCAGGAGCCGATAGACAAGGAACTGTA +TCCTTTAGCTTCCCTCAGATCACTCTTTGGCAACGACCCCTCGTCACAATAAAGATAGGGGGGCAACTAAAGGAAGCTCT +ATTAGATACAGGAGCAGATGATACAGTATTAGAAGAAATGAATTTGCCAGGAAGATGGAAACCAAAAATGATAGGGGGAA +TTGGAGGTTTTATCAAAGTAAGACAGTATGATCAGATACTCATAGAAATCTGTGGACATAAAGCTATAGGTACAGTATTA +GTAGGACCTACACCTGTCAACATAATTGGAAGAAATCTGTTGACTCAGATTGGTTGCACTTTAAATTTTCCCATTAGTCC +TATTGAAACTGTACCAGTAAAATTAAAGCCAGGAATGGATGGCCCAAAAGTTAAACAATGGCCATTGACAGAAGAAAAAA +TAAAAGCATTAGTAGAAATTTGTACAGAAATGGAAAAGGAAGGGAAAATTTCAAAAATTGGGCCTGAAAATCCATACAAT +ACTCCAGTATTTGCCATAAAGAAAAAAGACAGTACTAAATGGAGAAAATTAGTAGATTTCAGAGAACTTAATAAGAGAAC +TCAAGACTTCTGGGAAGTTCAATTAGGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGATG +TGGGTGATGCATATTTTTCAGTTCCCTTAGATAAAGACTTCAGGAAGTATACTGCATTTACCATACCTAGTATAAACAAT +GAGACACCAGGGATTAGATATCAGTACAATGTGCTTCCACAGGGATGGAAAGGATCACCAGCAATATTCCAAAGTAGCAT +GACAAAAATCTTAGAGCCTTTTAGAAAACAAAATCCAGACATAGTTATCTATCAATACATGGATGATTTGTATGTAGGAT +CTGACTTAGAAATAGGGCAGCATAGAACAAAAATAGAGGAACTGAGACAACATCTGTTGAGGTGGGGATTTACCACACCA +GACAAAAAACATCAGAAAGAACCTCCATTCCTTTGGATGGGTTATGAACTCCATCCTGATAAATGGACAGTACAGCCTAT +AGTGCTGCCAGAAAAAGACAGCTGGACTGTCAATGACATACAGAAGTTAGTGGGAAAATTGAATTGGGCAAGTCAGATTT +ATGCAGGGATTAAAGTAAAGCAATTATGTAAACTCCTTAGGGGAACCAAAGCACTAACAGAAGTAATACCACTAACAGAA +GAAGCAGAGCTAGAACTGGCAGAAAACAGGGAGATTCTAAAAGAACCAGTACATGGAGTGTATTATGACCCATCAAAAGA +CTTAATAGCAGAAATACAGAAGCAGGGGCAAGGCCAATGGACATATCAAATTTATCAAGAGCCATTTAAAAATCTGAAAA +CAGGAAAGTATGCAAGAATGAGGGGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAAAAATAGCCACA +GAAAGCATAGTAATATGGGGAAAGACTCCTAAATTTAAACTACCCATACAAAAAGAAACATGGGAAGCATGGTGGACAGA +GTATTGGCAAGCCACCTGGATTCCTGAGTGGGAGTTTGTCAATACCCCTCCCTTAGTGAAATTATGGTACCAGTTAGAGA +AAGAACCCATAGTAGGAGCAGAAACTTTCTATGTAGATGGGGCAGCTAATAGGGAGACTAAATTAGGAAAAGCAGGATAT +GTTACTGACAGAGGAAGACAAAAAGTTGTCTCCCTAACTGACACAACAAATCAGAAGACTGAGTTACAAGCAATTCATCT +AGCTTTGCAGGATTCGGGATTAGAAGTAAACATAGTAACAGACTCACAATATGCATTAGGAATCATTCAAGCACAACCAG +ATAAAAGTGAATCAGAGTTAGTCAGTCAAATAATAGAGCAGTTAATAAAAAAGGAAAAGGTCTACCTGGCATGGGTACCA +GCACACAAAGGAATTGGAGGAAATGAACAAGTAGATAAATTAGTCAGTGCTGGAATCAGGAAAGTACTATTTTTAGATGG +AATAGATAAGGCCCAAGAAGAACATGAGAAATATCACAGTAATTGGAGAGCAATGGCTAGTGATTTTAACCTGCCACCTG +TAGTAGCAAAAGAAATAGTAGCCAGCTGTGATAAATGTCAGCTAAAAGGAGAAGCCATGCATGGACAAGTAGACTGTAGT +CCAGGAATATGGCAACTAGATTGTACACATTTAGAAGGAAAAATTATCCTGGTAGCAGTTCATGTAGCCAGTGGATATAT +AGAAGCAGAAGTTATTCCAGCAGAGACAGGGCAGGAAACAGCATACTTTCTCTTAAAATTAGCAGGAAGATGGCCAGTAA +AAACAATACATACAGACAATGGCAGCAATTTCACCAGTACTACGGTTAAGGCCGCCTGTTGGTGGGCAGGGATCAAGCAG +GAATTTGGCATTCCCTACAATCCCCAAAGTCAAGGAGTAGTAGAATCTATGAATAAAGAATTAAAGAAAATTATAGGACA +GGTAAGAGATCAGGCTGAACATCTTAAGACAGCAGTACAAATGGCAGTATTCATCCACAATTTTAAAAGAAAAGGGGGGA +TTGGGGGGTACAGTGCAGGGGAAAGAATAGTAGACATAATAGCAACAGACATACAAACTAAAGAATTACAAAAACAAATT +ACAAAAATTCAAAATTTTCGGGTTTATTACAGGGACAGCAGAGATCCACTTTGGAAAGGACCAGCAAAGCTTCTCTGGAA +AGGTGAAGGGGCAGTAGTAATACAAGATAATAGTGACATAAAAGTAGTGCCAAGAAGAAAAGCAAAGATCATTAGGGATT +ATGGAAAACAGATGGCAGGTGATGATTGTGTGGCAAGTAGACAGGATGAGGATTAGAACATGGAAAAGTTTAGTAAAACA +CCATATGTATGTTTCAGGGAAAGCTAGGGGATGGTTTTATAGACATCACTATGAAAGCCCTCATCCAAGAATAAGTTCAG +AAGTACACATCCCACTAGGGGATGCTAGATTGGTAATAACAACATATTGGGGTCTGCATACAGGAGAAAGAGACTGGCAT +TTGGGTCAGGGAGTCTCCATAGAATGGAGGAAAAAGAGATATAGCACACAAGTAGACCCTGAACTAGCAGACCAACTAAT +TCATCTGTATTACTTTGACTGTTTTTCAGACTCTGCTATAAGAAAGGCCTTATTAGGACACATAGTTAGCCCTAGGTGTG +AATATCAAGCAGGACATAACAAGGTAGGATCTCTACAATACTTGGCACTAGCAGCATTAATAACACCAAAAAAGATAAAG +CCACCTTTGCCTAGTGTTACGAAACTGACAGAGGATAGATGGAACAAGCCCCAGAAGACCAAGGGCCACAGAGGGAGCCA +CACAATGAATGGACACTAGAGCTTTTAGAGGAGCTTAAGAATGAAGCTGTTAGACATTTTCCTAGGATTTGGCTCCATGG +CTTAGGGCAACATATCTATGAAACTTATGGGGATACTTGGGCAGGAGTGGAAGCCATAATAAGAATTCTGCAACAACTGC +TGTTTATCCATTTTCAGAATTGGGTGTCGACATAGCAGAATAGGCGTTACTCGACAGAGGAGAGCAAGAAATGGAGCCAG +TAGATCCTAGACTAGAGCCCTGGAAGCATCCAGGAAGTCAGCCTAAAACTGCTTGTACCAATTGCTATTGTAAAAAGTGT +TGCTTTCATTGCCAAGTTTGTTTCATAACAAAAGCCTTAGGCATCTCCTATGGCAGGAAGAAGCGGAGACAGCGACGAAG +AGCTCATCAGAACAGTCAGACTCATCAAGCTTCTCTATCAAAGCAGTAAGTAGTACATGTAACGCAACCTATACCAATAG +TAGCAATAGTAGCATTAGTAGTAGCAATAATAATAGCAATAGTTGTGTGGTCCATAGTAATCATAGAATATAGGAAAATA +TTAAGACAAAGAAAAATAGACAGGTTAATTGATAGACTAATAGAAAGAGCAGAAGACAGTGGCAATGAGAGTGAAGGAGA +AATATCAGCACTTGTGGAGATGGGGGTGGAGATGGGGCACCATGCTCCTTGGGATGTTGATGATCTGTAGTGCTACAGAA +AAATTGTGGGTCACAGTCTATTATGGGGTACCTGTGTGGAAGGAAGCAACCACCACTCTATTTTGTGCATCAGATGCTAA +AGCATATGATACAGAGGTACATAATGTTTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAGTAGTAT +TGGTAAATGTGACAGAAAATTTTAACATGTGGAAAAATGACATGGTAGAACAGATGCATGAGGATATAATCAGTTTATGG +GATCAAAGCCTAAAGCCATGTGTAAAATTAACCCCACTCTGTGTTAGTTTAAAGTGCACTGATTTGAAGAATGATACTAA +TACCAATAGTAGTAGCGGGAGAATGATAATGGAGAAAGGAGAGATAAAAAACTGCTCTTTCAATATCAGCACAAGCATAA +GAGGTAAGGTGCAGAAAGAATATGCATTTTTTTATAAACTTGATATAATACCAATAGATAATGATACTACCAGCTATAAG +TTGACAAGTTGTAACACCTCAGTCATTACACAGGCCTGTCCAAAGGTATCCTTTGAGCCAATTCCCATACATTATTGTGC +CCCGGCTGGTTTTGCGATTCTAAAATGTAATAATAAGACGTTCAATGGAACAGGACCATGTACAAATGTCAGCACAGTAC +AATGTACACATGGAATTAGGCCAGTAGTATCAACTCAACTGCTGTTAAATGGCAGTCTAGCAGAAGAAGAGGTAGTAATT +AGATCTGTCAATTTCACGGACAATGCTAAAACCATAATAGTACAGCTGAACACATCTGTAGAAATTAATTGTACAAGACC +CAACAACAATACAAGAAAAAGAATCCGTATCCAGAGAGGACCAGGGAGAGCATTTGTTACAATAGGAAAAATAGGAAATA +TGAGACAAGCACATTGTAACATTAGTAGAGCAAAATGGAATAACACTTTAAAACAGATAGCTAGCAAATTAAGAGAACAA +TTTGGAAATAATAAAACAATAATCTTTAAGCAATCCTCAGGAGGGGACCCAGAAATTGTAACGCACAGTTTTAATTGTGG +AGGGGAATTTTTCTACTGTAATTCAACACAACTGTTTAATAGTACTTGGTTTAATAGTACTTGGAGTACTGAAGGGTCAA +ATAACACTGAAGGAAGTGACACAATCACCCTCCCATGCAGAATAAAACAAATTATAAACATGTGGCAGAAAGTAGGAAAA +GCAATGTATGCCCCTCCCATCAGTGGACAAATTAGATGTTCATCAAATATTACAGGGCTGCTATTAACAAGAGATGGTGG +TAATAGCAACAATGAGTCCGAGATCTTCAGACCTGGAGGAGGAGATATGAGGGACAATTGGAGAAGTGAATTATATAAAT +ATAAAGTAGTAAAAATTGAACCATTAGGAGTAGCACCCACCAAGGCAAAGAGAAGAGTGGTGCAGAGAGAAAAAAGAGCA +GTGGGAATAGGAGCTTTGTTCCTTGGGTTCTTGGGAGCAGCAGGAAGCACTATGGGCGCAGCCTCAATGACGCTGACGGT +ACAGGCCAGACAATTATTGTCTGGTATAGTGCAGCAGCAGAACAATTTGCTGAGGGCTATTGAGGCGCAACAGCATCTGT +TGCAACTCACAGTCTGGGGCATCAAGCAGCTCCAGGCAAGAATCCTGGCTGTGGAAAGATACCTAAAGGATCAACAGCTC +CTGGGGATTTGGGGTTGCTCTGGAAAACTCATTTGCACCACTGCTGTGCCTTGGAATGCTAGTTGGAGTAATAAATCTCT +GGAACAGATTTGGAATCACACGACCTGGATGGAGTGGGACAGAGAAATTAACAATTACACAAGCTTAATACACTCCTTAA +TTGAAGAATCGCAAAACCAGCAAGAAAAGAATGAACAAGAATTATTGGAATTAGATAAATGGGCAAGTTTGTGGAATTGG +TTTAACATAACAAATTGGCTGTGGTATATAAAATTATTCATAATGATAGTAGGAGGCTTGGTAGGTTTAAGAATAGTTTT +TGCTGTACTTTCTATAGTGAATAGAGTTAGGCAGGGATATTCACCATTATCGTTTCAGACCCACCTCCCAACCCCGAGGG +GACCCGACAGGCCCGAAGGAATCGAAGAAGAAGGTGGAGAGAGAGACAGAGACAGATCCATTCGATTAGTGAACGGATCC +TTGGCACTTATCTGGGACGATCTGCGGAGCCTGTGCCTCTTCAGCTACCACCGCTTGAGAGACTTACTCTTGATTGTAAC +GAGGATTGTGGAACTTCTGGGACGCAGGGGGTGGGAAGCCCTCAAATATTGGTGGAATCTCCTACAGTATTGGAGTCAGG +AACTAAAGAATAGTGCTGTTAGCTTGCTCAATGCCACAGCCATAGCAGTAGCTGAGGGGACAGATAGGGTTATAGAAGTA +GTACAAGGAGCTTGTAGAGCTATTCGCCACATACCTAGAAGAATAAGACAGGGCTTGGAAAGGATTTTGCTATAAGATGG +GTGGCAAGTGGTCAAAAAGTAGTGTGATTGGATGGCCTACTGTAAGGGAAAGAATGAGACGAGCTGAGCCAGCAGCAGAT +AGGGTGGGAGCAGCATCTCGAGACCTGGAAAAACATGGAGCAATCACAAGTAGCAATACAGCAGCTACCAATGCTGCTTG +TGCCTGGCTAGAAGCACAAGAGGAGGAGGAGGTGGGTTTTCCAGTCACACCTCAGGTACCTTTAAGACCAATGACTTACA +AGGCAGCTGTAGATCTTAGCCACTTTTTAAAAGAAAAGGGGGGACTGGAAGGGCTAATTCACTCCCAAAGAAGACAAGAT +ATCCTTGATCTGTGGATCTACCACACACAAGGCTACTTCCCTGATTGGCAGAACTACACACCAGGGCCAGGGGTCAGATA +TCCACTGACCTTTGGATGGTGCTACAAGCTAGTACCAGTTGAGCCAGATAAGATAGAAGAGGCCAATAAAGGAGAGAACA +CCAGCTTGTTACACCCTGTGAGCCTGCATGGGATGGATGACCCGGAGAGAGAAGTGTTAGAGTGGAGGTTTGACAGCCGC +CTAGCATTTCATCACGTGGCCCGAGAGCTGCATCCGGAGTACTTCAAGAACTGCTGACATCGAGCTTGCTACAAGGGACT +TTCCGCTGGGGACTTTCCAGGGAGGCGTGGCCTGGGCGGGACTGGGGAGTGGCGAGCCCTCAGATCCTGCATATAAGCAG +CTGCTTTTTGCCTGTACTGGGTCTCTCTGGTTAGACCAGATCTGAGCCTGGGAGCTCTCTGGCTAACTAGGGAACCCACT +GCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGAT +CCCTCAGACCCTTTTAGTCAGTGTGGAAAATCTCTAGCA diff --git a/genome/codfreq/codfreq.gff b/genome/codfreq/codfreq.gff new file mode 100644 index 000000000..b5d2f225b --- /dev/null +++ b/genome/codfreq/codfreq.gff @@ -0,0 +1,25 @@ +##gff-version 3 +# Liftoff v1.6.3 +# /usr/local/bin/liftoff -g codfreq.gff -p 4 -o codfreq.gff -u codfreq.unmapped.txt target.fasta reference.fasta +HXB2_x_ConsensusB Liftoff gene 790 2289 . + . ID=gene-gag;Name=gag;gbkey=Gene;gene=gag;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_stop_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-gag_0 +HXB2_x_ConsensusB Liftoff CDS 790 2289 . + . ID=cds-gag;Parent=gene-gag;gbkey=CDS;gene=gag;product=gag protein;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 1186 1878 . + . ID=gene-CA;Name=CA;gbkey=Gene;gene=CA;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_start_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-CA_0 +HXB2_x_ConsensusB Liftoff CDS 1186 1878 . + . ID=cds-CA;Parent=gene-CA;gbkey=CDS;gene=CA;product=CA protein;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 2088 5093 . + . ID=gene-pol;Name=pol;gbkey=Gene;gene=pol;partial=true;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_start_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-pol_0 +HXB2_x_ConsensusB Liftoff CDS 2088 5093 . + . ID=cds-pol;Parent=gene-pol;gbkey=CDS;gene=pol;partial=true;product=pol protein;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 2253 2549 . + . ID=gene-PR;Name=PR;gbkey=Gene;gene=PR;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_start_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-PR_0 +HXB2_x_ConsensusB Liftoff CDS 2253 2549 . + . ID=cds-PR;Parent=gene-PR;Name=PR;gbkey=CDS;gene=PR;product=protease;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 2550 4229 . + . ID=gene-RT;Name=RT;gbkey=Gene;gene=RT;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_start_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-RT_0 +HXB2_x_ConsensusB Liftoff CDS 2550 4229 . + . ID=cds-RT;Parent=gene-RT;Name=RT;gbkey=CDS;gene=RT;product=reverse transcriptase;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 4230 5093 . + . ID=gene-IN;Name=IN;gbkey=Gene;gene=IN;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_start_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-IN_0 +HXB2_x_ConsensusB Liftoff CDS 4230 5093 . + . ID=cds-IN;Parent=gene-IN;Name=IN;gbkey=CDS;gene=IN;product=integrase;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 5041 5616 . + . ID=gene-vif;Name=vif;gbkey=Gene;gene=vif;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_stop_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-vif_0 +HXB2_x_ConsensusB Liftoff CDS 5041 5616 . + . ID=cds-vif;Parent=gene-vif;gbkey=CDS;gene=vif;product=vif protein;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 5559 5847 . + . ID=gene-vpr;Name=vpr;gbkey=Gene;gene=vpr;coverage=1.0;sequence_ID=1.0;extra_copy_number=0;copy_num_ID=gene-vpr_0 +HXB2_x_ConsensusB Liftoff gene 5831 8466 . + . ID=gene-tat;Name=tat;gbkey=Gene;gene=tat;coverage=1.0;sequence_ID=1.0;extra_copy_number=0;copy_num_ID=gene-tat_0 +HXB2_x_ConsensusB Liftoff gene 5970 8650 . + . ID=gene-rev;Name=rev;gbkey=Gene;gene=rev;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_stop_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-rev_0 +HXB2_x_ConsensusB Liftoff CDS 5970 8650 . + . ID=cds-rev;Parent=gene-rev;gbkey=CDS;gene=rev;product=rev protein;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 6062 6307 . + . ID=gene-vpu;Name=vpu;gbkey=Gene;gene=vpu;coverage=1.0;sequence_ID=1.0;extra_copy_number=0;copy_num_ID=gene-vpu_0 +HXB2_x_ConsensusB Liftoff gene 6225 8792 . + . ID=gene-env;Name=env;gbkey=Gene;gene=env;coverage=1.0;sequence_ID=1.0;matches_ref_protein=True;valid_ORF=False;missing_stop_codon=True;valid_ORFs=0;extra_copy_number=0;copy_num_ID=gene-env_0 +HXB2_x_ConsensusB Liftoff CDS 6225 8792 . + . ID=cds-env;Parent=gene-env;gbkey=CDS;gene=env;product=envelope glycoprotein;transl_table=11;extra_copy_number=0 +HXB2_x_ConsensusB Liftoff gene 8797 9414 . + . ID=gene-nef;Name=nef;gbkey=Gene;gene=nef;coverage=1.0;sequence_ID=1.0;extra_copy_number=0;copy_num_ID=gene-nef_0 From 541e0b1f61b6583bf8573e8914ddc1376158891f Mon Sep 17 00:00:00 2001 From: svarona Date: Tue, 4 Nov 2025 15:43:40 +0100 Subject: [PATCH 3/8] added codfreq documentation --- README.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/README.md b/README.md index f325edd70..b0ca4b6a1 100644 --- a/README.md +++ b/README.md @@ -119,6 +119,14 @@ This data set uses the NCBI reference sequence NC_001802 based on the HXB2 genom - `NC_001802.1.fasta`: Human immunodeficiency virus 1 genome fasta file downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/nuccore/NC_001802.1/) - `NC_001802.1.gff`: Human immunodeficiency virus 1 genome GFF3 annotation file downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/nuccore/NC_001802.1/) +#### `codfreq` + +This reference was generated using the [HIV JSON profile](https://github.com/hivdb/codfreq/blob/main/profiles/HIV1.json) from codfreq software. + +- `codfreq.fasta`: Was generated from the `"refSequence"` key of the .json file. +- `codonfreq.gff`: Was manually generated using the information from `"fragmentName"` and `"refRanges"` from `"fragmentConfig"`. + +This is the default reference used in the nf-core/viralrecon HIV resistance detection protocol for the resulting codon frequencies and codon coverages to be directly comparable to those produced by [**HIVdb**](https://hivdb.stanford.edu/hivdb/by-reads/), ensuring accurate interpretation of resistance data. ### `illumina/sispa/` From 7d61db7def14095febce59e5eacd63c3cba6dabf Mon Sep 17 00:00:00 2001 From: Sarai Varona Date: Wed, 5 Nov 2025 09:57:08 +0100 Subject: [PATCH 4/8] fixed typo Co-authored-by: Lili Andersson-Li <64467552+LilyAnderssonLee@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b0ca4b6a1..e72ee9b72 100644 --- a/README.md +++ b/README.md @@ -113,7 +113,7 @@ Crimea Congo is a fragmented genome with three fragments. S, M and L based on th This reference was chosen based on [Nextclade's](https://clades.nextstrain.org/dataset) HIV reference which states: ``` -This data set uses the NCBI reference sequence NC_001802 based on the HXB2 genome K03455. The primary reason for choosing it is to ensure amino acid substitutions in conserved proteins such as Pol are numbered consistently. Note that this sequence as a few problems, including a premature stop-codon in nef. +This data set uses the NCBI reference sequence NC_001802 based on the HXB2 genome K03455. The primary reason for choosing it is to ensure amino acid substitutions in conserved proteins such as Pol are numbered consistently. Note that this sequence has a few problems, including a premature stop-codon in nef. ``` - `NC_001802.1.fasta`: Human immunodeficiency virus 1 genome fasta file downloaded from [NCBI](https://www.ncbi.nlm.nih.gov/nuccore/NC_001802.1/) From a5bc6af8c6bf6d0a67a084a45543b58908c3012b Mon Sep 17 00:00:00 2001 From: Sarai Varona Date: Wed, 5 Nov 2025 09:59:36 +0100 Subject: [PATCH 5/8] fixed typo Co-authored-by: Lili Andersson-Li <64467552+LilyAnderssonLee@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index e72ee9b72..fd24fb5e9 100644 --- a/README.md +++ b/README.md @@ -156,7 +156,7 @@ This is the default reference used in the nf-core/viralrecon HIV resistance dete ### `illumina/hiv/` -This dasatet was chosen because it is the example data for [HIVdb Drug Resistance Database](https://hivdb.stanford.edu/hivdb/by-reads/): +This dataset was chosen because it is the example data for [HIVdb Drug Resistance Database](https://hivdb.stanford.edu/hivdb/by-reads/): - DRR030302: Amplicon Whole Genome sequencing - SRR4071760: Amplification of protease-RT genes From e04e77e9ba4b44ebbd6c755683a61b7b92e47865 Mon Sep 17 00:00:00 2001 From: Sarai Varona Date: Wed, 5 Nov 2025 09:59:58 +0100 Subject: [PATCH 6/8] fixed description Co-authored-by: Lili Andersson-Li <64467552+LilyAnderssonLee@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index fd24fb5e9..9a48c8c00 100644 --- a/README.md +++ b/README.md @@ -160,7 +160,7 @@ This dataset was chosen because it is the example data for [HIVdb Drug Resistanc - DRR030302: Amplicon Whole Genome sequencing - SRR4071760: Amplification of protease-RT genes -- SRR6937100: Amplification integrase gene +- SRR6937100: Amplification of integrase genes | file | num_seqs | sum_len | min_len | avg_len | max_len | file_size | Sequencer | LibrarySource | | --------------------- | -------- | --------- | ------- | ------- | ------- | --------- | ----------- | ------------- | From da401c103787422bcf46ce26ed7334d65ba57a40 Mon Sep 17 00:00:00 2001 From: svarona Date: Wed, 5 Nov 2025 09:59:11 +0100 Subject: [PATCH 7/8] updated samplesheet information --- README.md | 45 +++++++++++++++++++++++---------------------- 1 file changed, 23 insertions(+), 22 deletions(-) diff --git a/README.md b/README.md index 9a48c8c00..c71045157 100644 --- a/README.md +++ b/README.md @@ -6,51 +6,52 @@ This branch contains test data to be used for automated testing with the [nf-cor ### `samplesheet/` +This directory contains the sample sheets used to test different test configurations of the `nf-core/viralrecon` pipeline. + #### `samplesheet_test_nanopore.csv` -Sample information sheet required to test the pipeline containing sample names and barcodes for MinION data hosted in this repository. For testing purposes, some barcodes have been appended to this samplesheet that may not necessarily have associated data. The raw data associated with this run can be found in [`nanopore/minion`](nanopore/minion), and has been sub-setted to include a maximum of 3 `fast5`/`fastq`files per barcode. +Sample sheet for Nanopore test data. +Includes SARS-CoV-2 sample names and MinION barcodes hosted in this repository. +For testing purposes, some barcodes are included without associated data. +Raw data can be found in [`nanopore/minion`](nanopore/minion) and have been subsetted to include a maximum of three `fast5` or `fastq` files per barcode. #### `samplesheet_test_sra.csv` -Sample information sheet required to test the pipeline containing sample information of one link to a original full FastQ files and two files that must be downloaded from SRA one single-end and one paired-end, respectively. - -This sample sheet corresponds to SARS-CoV-2 Illumina SISPA data. +Sample sheet for SISPA-based Illumina data. +Contains SARS-CoV-2 SISPA probe enriched sample information, including one link to original full FastQ files and two datasets to be downloaded from SRA (one single-end and one paired-end). +Used to test SRA-based inputs and mixed dataset handling. #### `samplesheet_test_illumina_sispa.csv` -Sample information sheet required to test the pipeline containing sample information and links to FastQ files stored in this repository. - -This sample sheet corresponds to SARS-CoV-2 Illumina SISPA data. +Sample sheet for small-scale SISPA test data. +Contains SARS-CoV-2 SISPA probe enriched sample information and links to corresponding FastQ files. #### `samplesheet_test_illumina_amplicon.csv` -Sample information sheet required to test the pipeline containing sample information and links to FastQ files stored in this repository. - -This sample sheet corresponds to SARS-CoV-2 Illumina amplicon primer enrichment data. +Sample sheet for small-scale amplicon test data. +Includes SARS-CoV-2 amplicon primer enriched sample sample information and links to FastQ files hosted in this repository. #### `samplesheet_full_illumina_sispa.csv` -Sample information sheet required to test the pipeline containing sample information and links to original full FastQ files. - -This sample sheet corresponds to SARS-CoV-2 Illumina SISPA data. +Sample sheet for full-scale SISPA test data. +Contains SARS-CoV-2 SISPA probe enriched sample information and links to corresponding FastQ files. #### `samplesheet_full_illumina_amplicon.csv` -Sample information sheet required to test the pipeline containing sample information and links to original full FastQ files. - -This sample sheet corresponds to SARS-CoV-2 Illumina amplicon primer enrichment data. +Sample sheet for full-scale amplicon test data. +Includes SARS-CoV-2 amplicon primer enriched sample sample information and links to FastQ files hosted in this repository. #### `samplesheet_full_illumina_fragmented.csv` -Sample information sheet required to test the pipeline containing sample information and links to FastQ files stored in this repository. - -This sample sheet corresponds to Crimea Congo data. +Sample sheet for fragmented genome tests. +Contains Crimean-Congo hemorrhagic fever virus sample information and links to FastQ files stored in this repository. +Used to assess pipeline performance on non-contiguous viral genomes. #### `v3.0/samplesheet_test_hiv.csv` -Sample information sheet required to test the pipeline containing sample information and links to FastQ files stored in this repository. - -This sample sheet corresponds to HIV Illumina amplicon primer enrichment data from different SRA experiments. For test purposes these will be trated as non amplicon data. +Sample sheet for HIV test data. +Contains HIV Illumina amplicon primer enriched sample information and links to FastQ files stored in this repository. +Data originate from multiple SRA experiments and are treated as non-amplicon data for testing purposes. ### `genome/` From 212400da50e0b93764c3dd81116bc3db57944a22 Mon Sep 17 00:00:00 2001 From: svarona Date: Wed, 5 Nov 2025 10:01:39 +0100 Subject: [PATCH 8/8] added codfreq github link --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c71045157..71ec59529 100644 --- a/README.md +++ b/README.md @@ -122,7 +122,7 @@ This data set uses the NCBI reference sequence NC_001802 based on the HXB2 genom #### `codfreq` -This reference was generated using the [HIV JSON profile](https://github.com/hivdb/codfreq/blob/main/profiles/HIV1.json) from codfreq software. +This reference was generated using the [HIV JSON profile](https://github.com/hivdb/codfreq/blob/main/profiles/HIV1.json) from [codfreq](https://github.com/hivdb/codfreq) software. - `codfreq.fasta`: Was generated from the `"refSequence"` key of the .json file. - `codonfreq.gff`: Was manually generated using the information from `"fragmentName"` and `"refRanges"` from `"fragmentConfig"`.