Skip to content

Conversation

@benjeffery
Copy link
Member

@benjeffery benjeffery commented Oct 16, 2025

I started to have a look at #1896 where one of the possible solutions is decoding haplotypes in C. (The current methods all do this by iterating over sites then returning the full samples*sites matrix).

Recent work on parent_index_array in the JIT code gave me an idea, I then couldn't sleep until I had tried it. The code in this PR is NSFJK (Not Safe For Jerome) but I thought I would share it now as I think it is interesting and I have to draw a line! Iteresting bits are here

The outline is:

  • Build parent edge index (using the fancy tricks in the JIT code)
  • Then for every node you are interested in:
    • Create a bitset resolved over sites and a result string of ancestral site states.
    • Iterate through the mutations above your chosen node, setting result of the mutations site to the derived allele, marking these sites as resolved
    • Iterate over the parent edges of your node, if they overlap any unresolved sites push them onto a stack
    • While there are entries in the stack:
      • Pop an edge, process mutations you see, marking in result and resolved. Push any parent edges that have unresolved sites on the stack
    • result should now be your node's haplotype
    • Clear the bitset and result, do the next node

Here is the perf on a tree sequence which is a simplified subset of 100k samples of chr 21 Quebecois sim:
variant_vs_haplotype
The summary being "if you are extracting less than 1000 genotypes the haplotype way is faster" and "if you're grabbing one node it can be 100 times faster"

I have done a little perf work, but I think there are probably big wins to be had by clever caching of haplotypes of nodes higher up the tree or something like that, and not calculating the full parent edge index when you're grabbing a slice of the genome.

@benjeffery benjeffery marked this pull request as draft October 16, 2025 12:36
@jeromekelleher
Copy link
Member

It does looks like a neat idea, but I think performance benefits are not compelling enough (on use-cases that we currently have) to put it on the critical path?

@codecov
Copy link

codecov bot commented Oct 16, 2025

Codecov Report

❌ Patch coverage is 10.30928% with 261 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.70%. Comparing base (20d630b) to head (dfd3055).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
python/tskit/jit/numba.py 0.00% 211 Missing ⚠️
python/tskit/trees.py 37.50% 37 Missing and 13 partials ⚠️

❗ There is a different number of reports uploaded between BASE (20d630b) and HEAD (dfd3055). Click for more details.

HEAD has 15 uploads less than BASE
Flag BASE (20d630b) HEAD (dfd3055)
c-tests 1 0
python-tests-no-jit 6 0
lwt-tests 1 0
python-c-tests 1 0
python-tests 6 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3298       +/-   ##
===========================================
- Coverage   89.79%   48.70%   -41.09%     
===========================================
  Files          29       18       -11     
  Lines       30962     8479    -22483     
  Branches     5664     1422     -4242     
===========================================
- Hits        27803     4130    -23673     
- Misses       1775     4176     +2401     
+ Partials     1384      173     -1211     
Flag Coverage Δ
c-tests ?
lwt-tests ?
python-c-tests ?
python-tests ?
python-tests-no-jit ?
python-tests-numpy1 48.70% <10.30%> (-1.36%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
python/tskit/trees.py 66.55% <37.50%> (-32.30%) ⬇️
python/tskit/jit/numba.py 0.00% <0.00%> (-97.91%) ⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@benjeffery
Copy link
Member Author

Yep, not something I want to hold up a release for or put any serious work into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants