-
Notifications
You must be signed in to change notification settings - Fork 79
Add json+binary codec #3306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add json+binary codec #3306
Conversation
bae6b68 to
6c62d45
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3306 +/- ##
==========================================
+ Coverage 89.80% 89.82% +0.02%
==========================================
Files 29 29
Lines 31026 31097 +71
Branches 5679 5686 +7
==========================================
+ Hits 27863 27934 +71
Misses 1777 1777
Partials 1386 1386
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
This is the right way. |
|
Hi @benjeffery! This looks like it might provide a reasonable solution, sure. You prefer it to the proposal you made before, I take it? Because at least it's just more top-level metadata, rather than an entirely new thing? Anyhow, I guess my main question is how this will look from the C API perspective. I see you've got a special header:
which you parse in the new codec to split out the JSON and binary parts. Do you intend to provide some kind of wrapper for this in C, so that clients don't need to hard-code the structure of this header, at write and at read? (Or at least make the "magic" and the codec version be public in a C header, so that those aspects of it don't need to be hard-coded in SLiM.) @petrelharp will this work for you on the pyslim end of things? Ah, I see your comment above now. :-> |
6c62d45 to
6f871a5
Compare
Hi @benjeffery, just a ping on this; wondering about the C API plan for this PR. Apart from that concern, this proposal looks great to me; once it's merged @petrelharp and I can update the tskit copy in SLiM to this and test it out. :-> Thanks! |
Sorry, school holidays here so had some leave. I'm not sure about this, it's been a long standing principle that the C code doesn't touch the contents of metadata or have anything to do with codecs. I can see a small helper that just extracts the binary buffer being helpful though. I've added that in the last commit. |
8625ace to
47d86c9
Compare
:->
This looks good to me, thanks. I can see the desire for the C code not to get into metadata/codec stuff, but this seems like a good exception to that principle to carve out, otherwise every client of this codec would have to reinvent this (potentially bug-prone) wheel. I think this is the right design. As far as I can tell there is no dependence here on the JSON string being null-terminated; is that correct? Looks like after calling And since the header length and magic bytes and such are in I'm happy with this, thanks! |
Yes, that's right as we have fixed lengths. Makes me think the helper should return pointers/lengths to both parts so that API users can grab and decode the JSON? |
That would perhaps be an improvement, yes. It's easy enough for the client to do the math as it is now; but it does seem better for the helper to do it, so that the client is not building in assumptions about the layout of the metadata. Actually, now that I think about it more, the client would also have to build in an assumption about the number of bytes in the codec's header, wouldn't they? If that's true, then the helper should definitely return pointers/lengths for both parts; we don't want the client's code knowing that sort of thing. |
|
Added returning of json with blob. |
|
This generally looks good to me. One question is should we provide a "setter" function in C also if we're providing a getter? |
Good point, I wasn't thinking about that end of things! It does seem like a good idea to not have the client hard-coding the structure of the header and such. |
facepalm! Of course. |
|
@bhaller Talking this over with JK just now, we realised that it is a problem that the bytes in the binary blob are not easily interpretable by tskit users e.g. pyslim, or described in the schema. To fix this we propose a metadata codec that allows specification of JSON and binary in one schema. It would look like: It would be an error to have identical keys in the two schemas, as the returned metadata object merges the two key lists. This allows minimal changes to existing code as opposed to returning with a top-level split as in the schema. This makes no difference at all to the C side, except that the binary should be encoded in a This is quite a bit more work, so I will have to return to it after tskit 1.0 is out the door, but as it needs no C changes you can go ahead and start developing assuming the current binary structure. Hopefully that all makes sense! |
Hmm, I see. I'm fine with this if you guys want to get into it. I think I can see how having a schema for the binary might make life easier for pyslim, since then I guess all the binary metadata would be automatically be parsed and made accessible by tskit on the python side? If the current PR's design doesn't have a schema for the JSON metadata, then I agree that that is clearly a problem; having that schema is essential. If it is just that the current PR's design doesn't have a schema for the binary metadata, but does have one for the JSON (which I assumed was the case, but hadn't really thought about as much as I should have!), then I'm not sure how much of a shortcoming that really is in practice, but I can see the broad motivations for attaching a schema to everything – completeness, consistency, transparency, etc. @petrelharp any thoughts here on how useful this would be for pyslim? It's unfortunate that this won't make it into tskit 1.0, just from the perspective of wanting tskit 1.0 to be kind of a completed vision with no loose ends; but of course that's an impossible dream, and this new proposal sounds like more work for sure, so no worries, push it out. :-O I will end up needing it in the next couple of months, though, for the next release of SLiM, so I hope we're not talking about pushing it too terribly far out, though; what timeframe would you estimate? Purely out of curiosity (I have no need/desire to do this in practice at present), would this new metadata codec be usable elsewhere in tskit? I.e., could you declare that the metadata column on any given tskit table uses this new codec? Or would its use be limited to the top-level metadata? I don't really have a clear idea of how all the metadata bits and pieces fit together architecturally in tskit. :-> Thanks for taking this on! As always, things turn out to be more complicated than originally envisioned. There must be a Somebody's Law about that. I greatly appreciate your time and effort. |
|
tskit 1.0 is a forward-compatibility boundary, not a statement of completeness. This work involves no breaking changes so it doesn't matter which side of that boundary it sits. I'm hoping to have 1.0 in the order of 1-2 weeks, and this work to follow after. Certainly fits in the timescale you have above. |
|
That sounds great, @benjeffery. |
Add a metadata codec that supports JSON and a binary blob.