question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Metadata schemas and automatic decoding

See original GitHub issue

The metadata API allows you to store any byte string along with nodes, sites and mutations. This allows us to store pickled Python objects, for example, which will work fine. However, this isn’t great for portability:

  1. Obviously won’t work on any other language than Python
  2. Pickling/unpickling requires the class definition to be in the current namespace, which can get tricky. In practise, this would probably end up causing compatability headaches across code versions.

It would be better if we could use some third-party approach for encoding metadata so that we can automatically decode the stored information into a Python object using a given schema. I’m thinking of things like JSON-schema and protobuf. If the schema is also stored in the HDF5 file, then the schema goes along with the data, making it much more likely the data can be interpreted correctly later.

I’ve done a little initial experimentation with this in #330, which shows how we can use JSON-schema, where the user is responsible for managing the schemas and so on. Here is a sketch of how it might work in the future if we put in some infrastructure for automatically decoding metadata into Python objects via a few pluggable hooks:

# Your schema for custom metadata defined using json-schema
schema = """{
    "title": "Example Metadata",
    "type": "object",
    "properties": {
        "one": {"type": "string"},
        "two": {"type": "string"}
    },
    "required": ["one", "two"]
}"""

# Load this schema into an object builder, so that we make metadata objects
# Using https://github.com/cwacek/python-jsonschema-objects
builder = pjs.ObjectBuilder(json.loads(self.schema))
ns = builder.build_classes()
# Make a new metadata object, and assign it to a row in the nodes table.
metadata = ns.ExampleMetadata(one="node1", two="node2")
encoded = json.dumps(metadata.as_dict()).encode()
nodes.add_row(time=0.125, metadata=encoded)
# Load this table into a tree sequence, and store the schema so it can be retrieved later.
ts = msprime.load_tables(nodes=nodes, edges=edges, sequence_length=1)
ts.node_metadata_schema = schema
ts.dump("metadata-example.hdf5")

# Later, on another system, etc, we can load up the metadata and work with it.
ts = msprime.load("metadata-example.hdf5")
schema = json.loads(ts.node_metadata_schema)
builder = pjs.ObjectBuilder(json.loads(self.schema))
ns = builder.build_classes()
def decoder(encoded):
   return ns.ExampleMetadata.from_json(encoded.decode())

# If a decoder function is set, accesses to the ``node.metadata`` are intercepted and the decoder
# function is use to translate the raw bytes into a Python object.
ts.node_metadata_decoder = decoder
node = ts.node(0)
node.metadata.one  # == "one"
node.metadata.two  # == "one"

 This is all a bit vague, but I thought I’d open an issue to start discussion on the topic. This is definitely not something we should tackle before 0.5.0 is released — none of this is backwards incompatible with what we have now.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
jeromekellehercommented, Mar 18, 2020

In my mind the consumers vastly out-number creators, and they will often be people with limited programming experience. Our job is to make their lives as easy as possible, and to make it easy for them to do the right thing.

1reaction
jeromekellehercommented, Mar 13, 2020

Do we care about the C API providing encode/decode? I don’t yet have a good feel for where the high-level / low-level boundary is.

This is definitely a non-goal. It’s just too complicated I think, and people can build what they need themselves on top of the raw bytes we’re providing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Introduction to Metadata: Setting the Stage
Data structure standards (metadata element sets, schemas). ... The Metadata Encoding and Transmission Standard (METS), developed by the Digital Library ...
Read more >
Understanding Metadata
Expressed using the XML schema language, METS provides a document format for encoding the metadata necessary for manage- ment of digital library objects...
Read more >
Working with Metadata — Tree Sequence Tutorials - tskit-dev
Reading metadata and schemas#. Metadata is automatically decoded using the schema when accessed via a TreeSequence or TableCollection Python API. For example:.
Read more >
Key Concepts - Metadata Basics
There are three main types of metadata: descriptive, ... Structure standards are also known as schemes, schemas, or element sets.
Read more >
What is metadata and how does it work? - TechTarget
Metadata is data that describes other data, providing a structured reference that helps to sort and identify attributes of the information it describes....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found