question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Clarify encoding of binary metadata for binary input/output

See original GitHub issue

Two things here. First, I want to check whether arbitrary bytes are allowed in metadata, derived_state, and ancestral_state columns, for tskit to deal with them. (I believe they are now, but want to check everyone is on board with this.)

Second, I run into issues when trying to actually do this. We are storing in derived_state a sequence of ints that indicate a set of slim’s mutations, just packed into the char *, as done here. Naively, i was hoping that I could table_collection_dump() these tables, then read them into python and do things. But, when I try do many thing with the tree sequence in python, I get utf-8 errors, like:

>>> ts = msprime.load("test_output/test_output.treeseq")
>>> vv = ts.variants()
>>> v = next(vv)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/peter/.local/lib/python3.6/site-packages/msprime/trees.py", line 1995, in variants
    for site_id, genotypes, alleles in iterator:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdc in position 0: invalid continuation byte

I have not yet tracked down where this assumption comes in, but I don’t understand the big picture. Could you clarify, @jeromekelleher?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
petrelharpcommented, Apr 17, 2018

omg emoji mutations! @bhaller, I have a new proposal for how to display derived states…

I think we will want to use them under the hood as arbitrary bytes; converting to unicode before output is no big deal. We’ll still have display issues since they won’t be single characters, but we could write an extension to do the emoji-displaying…

1reaction
jeromekellehercommented, Apr 17, 2018

I think we will want to use them under the hood as arbitrary bytes; converting to unicode before output is no big deal. We’ll still have display issues since they won’t be single characters, but we could write an extension to do the emoji-displaying…

Keeping arbitrary bytes in the actual table from C is no problem. We can add an option to bypass the unicode encoding/decoding step in Python by setting encoding=None or something similar. I just put the encoding option in there as a placeholder really, as I didn’t actually test anything other than utf8. Anyway, we can definitely accomodate this use-case without too much trouble I think.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Binary Representation of Images - teachwithict
Learning Objectives: Explain the representation of an image as a series of pixels represented in binary; Explain the need for metadata to be...
Read more >
Storing text in binary (article) | Khan Academy
We must agree on encodings, mappings from a character to a binary number. ... Explain. The HPE encoding only uses 2 bits, so...
Read more >
Binary Encoding - L3HarrisGeospatial.com
The binary encoding classification technique encodes the data and endmember spectra into zeros and ones, based on whether a band falls below or...
Read more >
One-Hot Encoding - an overview | ScienceDirect Topics
One hot encoding results in a binary representation of the categorical values (now the columns) where 1 represents presence and 0 represents absence....
Read more >
RFC 8949: Concise Binary Object Representation (CBOR)
The Concise Binary Object Representation (CBOR) is a data format whose design ... a reasonable set of basic data types and structures using...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found