Decoding haplotypes without trees #3298

benjeffery · 2025-10-16T12:36:17Z

I started to have a look at #1896 where one of the possible solutions is decoding haplotypes in C. (The current methods all do this by iterating over sites then returning the full samples*sites matrix).

Recent work on parent_index_array in the JIT code gave me an idea, I then couldn't sleep until I had tried it. The code in this PR is NSFJK (Not Safe For Jerome) but I thought I would share it now as I think it is interesting and I have to draw a line! Iteresting bits are here

The outline is:

Build parent edge index (using the fancy tricks in the JIT code)
Then for every node you are interested in:
- Create a bitset resolved over sites and a result string of ancestral site states.
- Iterate through the mutations above your chosen node, setting result of the mutations site to the derived allele, marking these sites as resolved
- Iterate over the parent edges of your node, if they overlap any unresolved sites push them onto a stack
- While there are entries in the stack:
- - Pop an edge, process mutations you see, marking in result and resolved. Push any parent edges that have unresolved sites on the stack
- result should now be your node's haplotype
- Clear the bitset and result, do the next node

Here is the perf on a tree sequence which is a simplified subset of 100k samples of chr 21 Quebecois sim:

The summary being "if you are extracting less than 1000 genotypes the haplotype way is faster" and "if you're grabbing one node it can be 100 times faster"

I have done a little perf work, but I think there are probably big wins to be had by clever caching of haplotypes of nodes higher up the tree or something like that, and not calculating the full parent edge index when you're grabbing a slice of the genome.

jeromekelleher · 2025-10-16T12:40:47Z

It does looks like a neat idea, but I think performance benefits are not compelling enough (on use-cases that we currently have) to put it on the critical path?

codecov · 2025-10-16T12:42:22Z

Codecov Report

❌ Patch coverage is 10.30928% with 261 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.70%. Comparing base (20d630b) to head (dfd3055).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
python/tskit/jit/numba.py	0.00%	211 Missing ⚠️
python/tskit/trees.py	37.50%	37 Missing and 13 partials ⚠️

❗ There is a different number of reports uploaded between BASE (20d630b) and HEAD (dfd3055). Click for more details.

HEAD has 15 uploads less than BASE

Flag BASE (20d630b) HEAD (dfd3055)

c-tests 1 0

python-tests-no-jit 6 0

lwt-tests 1 0

python-c-tests 1 0

python-tests 6 0

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #3298       +/-   ##
===========================================
- Coverage   89.79%   48.70%   -41.09%     
===========================================
  Files          29       18       -11     
  Lines       30962     8479    -22483     
  Branches     5664     1422     -4242     
===========================================
- Hits        27803     4130    -23673     
- Misses       1775     4176     +2401     
+ Partials     1384      173     -1211

Flag	Coverage Δ
c-tests	`?`
lwt-tests	`?`
python-c-tests	`?`
python-tests	`?`
python-tests-no-jit	`?`
python-tests-numpy1	`48.70% <10.30%> (-1.36%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
python/tskit/trees.py	`66.55% <37.50%> (-32.30%)`	⬇️
python/tskit/jit/numba.py	`0.00% <0.00%> (-97.91%)`	⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

benjeffery · 2025-10-16T12:54:01Z

Yep, not something I want to hold up a release for or put any serious work into.

benjeffery added 9 commits October 15, 2025 19:52

Numba only

0c1eeb4

First rough

5ec04ef

Tests pass

1330b43

Remove dead code

dab5732

Rename to parent

6f1fea4

Remove qsort

cbc6512

Fix warnings

24cf27a

Perf

d3029a6

Perf - Shift coverage handling to whole-word bit math.

22ae99a

benjeffery marked this pull request as draft October 16, 2025 12:36

Comments

af7f033

benjeffery force-pushed the alignments branch from 75e1002 to af7f033 Compare October 16, 2025 12:39

Fix jit

dfd3055

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decoding haplotypes without trees #3298

Decoding haplotypes without trees #3298

Uh oh!

benjeffery commented Oct 16, 2025 •

edited

Loading

Uh oh!

jeromekelleher commented Oct 16, 2025

Uh oh!

codecov bot commented Oct 16, 2025 •

edited

Loading

Uh oh!

benjeffery commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Decoding haplotypes without trees #3298

Are you sure you want to change the base?

Decoding haplotypes without trees #3298

Uh oh!

Conversation

benjeffery commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Oct 16, 2025

Uh oh!

codecov bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

benjeffery commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

benjeffery commented Oct 16, 2025 •

edited

Loading

codecov bot commented Oct 16, 2025 •

edited

Loading