|
| 1 | +# MMSeqs2 |
| 2 | + |
| 3 | +[MMSeqs2](https://github.com/soedinglab/MMseqs2) is a fast sequence searching algorithm that we use in replace of [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi) or [diamond](https://github.com/bbuchfink/diamond) |
| 4 | + |
| 5 | +MMSeqs2 has a lot of different options, and we have not yet included them all here, but you can find full details about MMSeqs2 in their [detailed manual](https://mmseqs.com/latest/userguide.pdf). |
| 6 | + |
| 7 | +# MMSeqs2 databases |
| 8 | + |
| 9 | +Like many tools, MMSeqs2 has precomputed databases that you can download. |
| 10 | + |
| 11 | +There is a [complete list on their website](https://github.com/soedinglab/MMseqs2/wiki#downloading-databases) |
| 12 | + |
| 13 | +You can download a database with the `databases` command. For example, to download the [UniRef50](https://www.uniprot.org/help/uniref) database: |
| 14 | + |
| 15 | +```bash |
| 16 | +mkdir -p UniRef50 |
| 17 | +mmseqs databases --threads 8 UniRef50 UniRef50/UniRef50 /tmp |
| 18 | +``` |
| 19 | + |
| 20 | +Some of the databases have taxonomy included with them, and that enables you to use `mmseqs easy-taxonomy` to explore the metagenome. |
| 21 | + |
| 22 | +# Easy Taxonomy |
| 23 | + |
| 24 | +We use the MMSeqs2 easy taxonomy a _lot_ for analysing metagenomes, especially by comparing to the [UniRef50](https://www.uniprot.org/help/uniref) database. |
| 25 | + |
| 26 | +First, `mmseqs easy-taxonomy` _requires_ `fasta` files and does not work with `fastq` files. We have a [fast way to convert fastq to fasta](https://edwards.flinders.edu.au/fastq-to-fasta/) or you can find some tools online. |
| 27 | + |
| 28 | +We also take advantage of `mmseqs` [sensitivity sweep](https://github.com/soedinglab/MMseqs2/wiki#set-sensitivity--s-parameter) but you should consider comparing [sensitivity and resources](https://github.com/soedinglab/MMseqs2/wiki#optimizing-sensitivity-and-consumption-of-resources). There is a lot of discussion on the [MMSeqs2 wiki](https://github.com/soedinglab/MMseqs2/wiki) about setting sensitivity. |
| 29 | + |
| 30 | + |
| 31 | +We typically use this command to run the easy taxonomy: |
| 32 | + |
| 33 | + |
| 34 | +```bash |
| 35 | +mkdir easy-taxonomy |
| 36 | +mmseqs easy-taxonomy sequence.fasta UniRef50/UniRef50 easy-taxonomy/sequence_taxonomy /tmp --start-sens 1 --sens-steps 3 -s 7 --threads 32 |
| 37 | +``` |
| 38 | + |
| 39 | +The results will be in a series of files in the `easy-taxonomy` directory, whose names start with `sequence_taxonomy`: |
| 40 | + |
| 41 | +SAGCFN_22_00809_S34_lca.tsv.gz SAGCFN_22_00809_S34_report.gz SAGCFN_22_00809_S34_tophit_aln.gz SAGCFN_22_00809_S34_tophit_report.gz |
| 42 | + |
| 43 | +- `sequence_taxonomy_lca.tsv.gz`: The lowest common ancestor of the sequences in tab separated text. |
| 44 | + |
| 45 | +Example output: |
| 46 | + |
| 47 | +``` |
| 48 | +R100400180029:20220829140225:V350082744:2:1145432:5:58/2/2 310915 species Pangasianodon hypophthalmus 2 2 1 0.540 |
| 49 | +``` |
| 50 | + |
| 51 | +Columns are: |
| 52 | +1. the sequencing read |
| 53 | +2. the taxonomy ID from [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/datasets/taxonomy/tree). For example, this is [310915](https://www.ncbi.nlm.nih.gov/datasets/taxonomy/310915/) |
| 54 | +3. the taxonomic clade. `Species` in this example |
| 55 | +4. The organism name. `Pangasianodon hypophthalmus` |
| 56 | +5. |
| 57 | + |
| 58 | + |
| 59 | +- `sequence_taxonomy_report.gz` a Kraken2 style output report |
| 60 | + |
| 61 | +Example output: |
| 62 | + |
| 63 | +``` |
| 64 | +0.8561 9653 9653 species 310915 Pangasianodon hypophthalmus |
| 65 | +``` |
| 66 | + |
| 67 | +- `sequence_taxonomy_tophit_aln.gz` the `blast m8` format |
| 68 | + |
| 69 | +Example output: |
| 70 | + |
| 71 | +``` |
| 72 | +R100400180029:20220829140225:V350082744:2:1145432:5:58/2/2 UniRef50_UPI00147C5152 0.382 163 30 0 0 50 0 163 1.796E-26 108 |
| 73 | +``` |
| 74 | + |
| 75 | +The columns are: |
| 76 | + |
| 77 | +1. Sequence Read |
| 78 | +2. Match database ID. In this case from the [UniRef50](https://www.uniprot.org/) we have sequence [UniRef50_UPI00147C5152](https://www.uniprot.org/uniref/UniRef50_UPI00147C5152) |
| 79 | +3. Similarity (38.2% identity) |
| 80 | +4. Alignment length (163 bases) |
| 81 | +5. Gaps (30 bases) |
| 82 | +6. Mismatches (0 bases) |
| 83 | +7. Start on the sequence read (0) |
| 84 | +8. End on the sequence read (50) |
| 85 | +9. Start on the database sequence (0) |
| 86 | +10. End on the database sequence (163) |
| 87 | +11. E-value (1.796E-26) |
| 88 | +12. Bit score |
| 89 | + |
| 90 | + |
| 91 | +- `sequence_taxonomy_tophit_report.gz` the taxonomy and matches to all of the proteins |
| 92 | + |
| 93 | +Example output |
| 94 | + |
| 95 | +``` |
| 96 | +UniRef50_UPI00147C5152 6970 0.312 1849.374 0.367 310915 species Pangasianodon hypophthalmus |
| 97 | +``` |
| 98 | + |
| 99 | +The columns are: |
| 100 | + |
| 101 | +1. Database ID. In this case from the [UniRef50](https://www.uniprot.org/) we have sequence [UniRef50_UPI00147C5152](https://www.uniprot.org/uniref/UniRef50_UPI00147C5152) |
| 102 | +2. Number of sequences aligning to target |
| 103 | +3. Unique coverage of target uniqueAlignedResidues / targetLength |
| 104 | +4. Target coverage alignedResidues / targetLength |
| 105 | +5. Average sequence identity |
| 106 | +6. [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/datasets/taxonomy/tree) identifier |
| 107 | +7. Taxonomic level `species` |
| 108 | +8. Taxonomic name, in this case `Pangasianodon hypophthalmus` |
| 109 | + |
| 110 | + |
| 111 | + |
| 112 | + |
| 113 | + |
| 114 | + |
0 commit comments