Clarify encoding of binary metadata for binary input/output
See original GitHub issueTwo things here. First, I want to check whether arbitrary bytes are allowed in metadata
, derived_state
, and ancestral_state
columns, for tskit
to deal with them. (I believe they are now, but want to check everyone is on board with this.)
Second, I run into issues when trying to actually do this. We are storing in derived_state
a sequence of ints that indicate a set of slim’s mutations, just packed into the char *
, as done here. Naively, i was hoping that I could table_collection_dump()
these tables, then read them into python and do things. But, when I try do many thing with the tree sequence in python, I get utf-8
errors, like:
>>> ts = msprime.load("test_output/test_output.treeseq")
>>> vv = ts.variants()
>>> v = next(vv)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/peter/.local/lib/python3.6/site-packages/msprime/trees.py", line 1995, in variants
for site_id, genotypes, alleles in iterator:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 0: invalid continuation byte
I have not yet tracked down where this assumption comes in, but I don’t understand the big picture. Could you clarify, @jeromekelleher?
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (6 by maintainers)
omg emoji mutations! @bhaller, I have a new proposal for how to display derived states…
I think we will want to use them under the hood as arbitrary bytes; converting to unicode before output is no big deal. We’ll still have display issues since they won’t be single characters, but we could write an extension to do the emoji-displaying…
Keeping arbitrary bytes in the actual table from C is no problem. We can add an option to bypass the unicode encoding/decoding step in Python by setting
encoding=None
or something similar. I just put the encoding option in there as a placeholder really, as I didn’t actually test anything other than utf8. Anyway, we can definitely accomodate this use-case without too much trouble I think.