-
Notifications
You must be signed in to change notification settings - Fork 79
Decoding haplotypes without trees #3298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
75e1002 to
af7f033
Compare
|
It does looks like a neat idea, but I think performance benefits are not compelling enough (on use-cases that we currently have) to put it on the critical path? |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3298 +/- ##
===========================================
- Coverage 89.79% 48.70% -41.09%
===========================================
Files 29 18 -11
Lines 30962 8479 -22483
Branches 5664 1422 -4242
===========================================
- Hits 27803 4130 -23673
- Misses 1775 4176 +2401
+ Partials 1384 173 -1211
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Yep, not something I want to hold up a release for or put any serious work into. |
I started to have a look at #1896 where one of the possible solutions is decoding haplotypes in C. (The current methods all do this by iterating over sites then returning the full samples*sites matrix).
Recent work on
parent_index_arrayin the JIT code gave me an idea, I then couldn't sleep until I had tried it. The code in this PR is NSFJK (Not Safe For Jerome) but I thought I would share it now as I think it is interesting and I have to draw a line! Iteresting bits are hereThe outline is:
resolvedover sites and aresultstring of ancestral site states.resultof the mutations site to the derived allele, marking these sites asresolvedresultandresolved. Push any parent edges that have unresolved sites on the stackresultshould now be your node's haplotypeHere is the perf on a tree sequence which is a simplified subset of 100k samples of chr 21 Quebecois sim:

The summary being "if you are extracting less than 1000 genotypes the haplotype way is faster" and "if you're grabbing one node it can be 100 times faster"
I have done a little perf work, but I think there are probably big wins to be had by clever caching of haplotypes of nodes higher up the tree or something like that, and not calculating the full parent edge index when you're grabbing a slice of the genome.