Skip to content

Commit 6400ca0

Browse files
committed
adding mmseqs2
1 parent 8dbbf0d commit 6400ca0

File tree

2 files changed

+134
-1
lines changed

2 files changed

+134
-1
lines changed

MMSeqs2/README.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# MMSeqs2
2+
3+
[MMSeqs2](https://github.com/soedinglab/MMseqs2) is a fast sequence searching algorithm that we use in replace of [blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi) or [diamond](https://github.com/bbuchfink/diamond)
4+
5+
MMSeqs2 has a lot of different options, and we have not yet included them all here, but you can find full details about MMSeqs2 in their [detailed manual](https://mmseqs.com/latest/userguide.pdf).
6+
7+
# MMSeqs2 databases
8+
9+
Like many tools, MMSeqs2 has precomputed databases that you can download.
10+
11+
There is a [complete list on their website](https://github.com/soedinglab/MMseqs2/wiki#downloading-databases)
12+
13+
You can download a database with the `databases` command. For example, to download the [UniRef50](https://www.uniprot.org/help/uniref) database:
14+
15+
```bash
16+
mkdir -p UniRef50
17+
mmseqs databases --threads 8 UniRef50 UniRef50/UniRef50 /tmp
18+
```
19+
20+
Some of the databases have taxonomy included with them, and that enables you to use `mmseqs easy-taxonomy` to explore the metagenome.
21+
22+
# Easy Taxonomy
23+
24+
We use the MMSeqs2 easy taxonomy a _lot_ for analysing metagenomes, especially by comparing to the [UniRef50](https://www.uniprot.org/help/uniref) database.
25+
26+
First, `mmseqs easy-taxonomy` _requires_ `fasta` files and does not work with `fastq` files. We have a [fast way to convert fastq to fasta](https://edwards.flinders.edu.au/fastq-to-fasta/) or you can find some tools online.
27+
28+
We also take advantage of `mmseqs` [sensitivity sweep](https://github.com/soedinglab/MMseqs2/wiki#set-sensitivity--s-parameter) but you should consider comparing [sensitivity and resources](https://github.com/soedinglab/MMseqs2/wiki#optimizing-sensitivity-and-consumption-of-resources). There is a lot of discussion on the [MMSeqs2 wiki](https://github.com/soedinglab/MMseqs2/wiki) about setting sensitivity.
29+
30+
31+
We typically use this command to run the easy taxonomy:
32+
33+
34+
```bash
35+
mkdir easy-taxonomy
36+
mmseqs easy-taxonomy sequence.fasta UniRef50/UniRef50 easy-taxonomy/sequence_taxonomy /tmp --start-sens 1 --sens-steps 3 -s 7 --threads 32
37+
```
38+
39+
The results will be in a series of files in the `easy-taxonomy` directory, whose names start with `sequence_taxonomy`:
40+
41+
SAGCFN_22_00809_S34_lca.tsv.gz SAGCFN_22_00809_S34_report.gz SAGCFN_22_00809_S34_tophit_aln.gz SAGCFN_22_00809_S34_tophit_report.gz
42+
43+
- `sequence_taxonomy_lca.tsv.gz`: The lowest common ancestor of the sequences in tab separated text.
44+
45+
Example output:
46+
47+
```
48+
R100400180029:20220829140225:V350082744:2:1145432:5:58/2/2 310915 species Pangasianodon hypophthalmus 2 2 1 0.540
49+
```
50+
51+
Columns are:
52+
1. the sequencing read
53+
2. the taxonomy ID from [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/datasets/taxonomy/tree). For example, this is [310915](https://www.ncbi.nlm.nih.gov/datasets/taxonomy/310915/)
54+
3. the taxonomic clade. `Species` in this example
55+
4. The organism name. `Pangasianodon hypophthalmus`
56+
5.
57+
58+
59+
- `sequence_taxonomy_report.gz` a Kraken2 style output report
60+
61+
Example output:
62+
63+
```
64+
0.8561 9653 9653 species 310915 Pangasianodon hypophthalmus
65+
```
66+
67+
- `sequence_taxonomy_tophit_aln.gz` the `blast m8` format
68+
69+
Example output:
70+
71+
```
72+
R100400180029:20220829140225:V350082744:2:1145432:5:58/2/2 UniRef50_UPI00147C5152 0.382 163 30 0 0 50 0 163 1.796E-26 108
73+
```
74+
75+
The columns are:
76+
77+
1. Sequence Read
78+
2. Match database ID. In this case from the [UniRef50](https://www.uniprot.org/) we have sequence [UniRef50_UPI00147C5152](https://www.uniprot.org/uniref/UniRef50_UPI00147C5152)
79+
3. Similarity (38.2% identity)
80+
4. Alignment length (163 bases)
81+
5. Gaps (30 bases)
82+
6. Mismatches (0 bases)
83+
7. Start on the sequence read (0)
84+
8. End on the sequence read (50)
85+
9. Start on the database sequence (0)
86+
10. End on the database sequence (163)
87+
11. E-value (1.796E-26)
88+
12. Bit score
89+
90+
91+
- `sequence_taxonomy_tophit_report.gz` the taxonomy and matches to all of the proteins
92+
93+
Example output
94+
95+
```
96+
UniRef50_UPI00147C5152 6970 0.312 1849.374 0.367 310915 species Pangasianodon hypophthalmus
97+
```
98+
99+
The columns are:
100+
101+
1. Database ID. In this case from the [UniRef50](https://www.uniprot.org/) we have sequence [UniRef50_UPI00147C5152](https://www.uniprot.org/uniref/UniRef50_UPI00147C5152)
102+
2. Number of sequences aligning to target
103+
3. Unique coverage of target uniqueAlignedResidues / targetLength
104+
4. Target coverage alignedResidues / targetLength
105+
5. Average sequence identity
106+
6. [NCBI taxonomy](https://www.ncbi.nlm.nih.gov/datasets/taxonomy/tree) identifier
107+
7. Taxonomic level `species`
108+
8. Taxonomic name, in this case `Pangasianodon hypophthalmus`
109+
110+
111+
112+
113+
114+

Workshops/INRB2023.md

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,26 @@ Then, we can assemble the filtered bacterial sequences using spades:
181181
spades.py --meta -1 not_human/788707_20180129_S_R1.fastq.gz -2 not_human/788707_20180129_S_R2.fastq.gz -o not_human_assembly
182182
```
183183

184-
Once you have assembled the sequences, the first step is to visualise the assembly with [bandage](https://rrwick.github.io/Bandage/).
184+
Once you have assembled the sequences, the first step is to visualise the assembly with [bandage](https://rrwick.github.io/Bandage/), and hopefully you will find an image like this
185+
186+
![Bandage plot](images/bandage.png)
187+
188+
189+
# Hecatomb
190+
191+
You can [find the Hecatomb tutorial on the readthedocs website](https://hecatomb.readthedocs.io/)
192+
193+
# Kraken annotations
194+
195+
You can download the [kraken2 databases](https://benlangmead.github.io/aws-indexes/k2) and [run Kraken2](../Kraken2/) on your samples.
196+
197+
# Functional annotations
198+
199+
Just as with taxonomy, there are two broad approaches to figuring out the functions that are present.
200+
201+
There are a suite of tools that use [heuristics](https://en.wikipedia.org/wiki/Heuristic) (en Français: heuristique). For example, [superfocus](../SUPER-FOCUS) uses _k_-mers, short sequences, to find the functions that are present.
202+
203+
Another way to find the functions, is to use [MMSeqs2](https://github.com/soedinglab/MMseqs2) easy-taxonomy. You can find more details about the easy-taxonomy in the [MMSeqs2 manual](https://mmseqs.com/latest/userguide.pdf)
185204

186205

187206

0 commit comments

Comments
 (0)