Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Metadata schemas and automatic decoding

See original GitHub issue

The metadata API allows you to store any byte string along with nodes, sites and mutations. This allows us to store pickled Python objects, for example, which will work fine. However, this isn’t great for portability:

Obviously won’t work on any other language than Python
Pickling/unpickling requires the class definition to be in the current namespace, which can get tricky. In practise, this would probably end up causing compatability headaches across code versions.

It would be better if we could use some third-party approach for encoding metadata so that we can automatically decode the stored information into a Python object using a given schema. I’m thinking of things like JSON-schema and protobuf. If the schema is also stored in the HDF5 file, then the schema goes along with the data, making it much more likely the data can be interpreted correctly later.

I’ve done a little initial experimentation with this in #330, which shows how we can use JSON-schema, where the user is responsible for managing the schemas and so on. Here is a sketch of how it might work in the future if we put in some infrastructure for automatically decoding metadata into Python objects via a few pluggable hooks:

# Your schema for custom metadata defined using json-schema
schema = """{
    "title": "Example Metadata",
    "type": "object",
    "properties": {
        "one": {"type": "string"},
        "two": {"type": "string"}
    },
    "required": ["one", "two"]
}"""

# Load this schema into an object builder, so that we make metadata objects
# Using https://github.com/cwacek/python-jsonschema-objects
builder = pjs.ObjectBuilder(json.loads(self.schema))
ns = builder.build_classes()
# Make a new metadata object, and assign it to a row in the nodes table.
metadata = ns.ExampleMetadata(one="node1", two="node2")
encoded = json.dumps(metadata.as_dict()).encode()
nodes.add_row(time=0.125, metadata=encoded)
# Load this table into a tree sequence, and store the schema so it can be retrieved later.
ts = msprime.load_tables(nodes=nodes, edges=edges, sequence_length=1)
ts.node_metadata_schema = schema
ts.dump("metadata-example.hdf5")

# Later, on another system, etc, we can load up the metadata and work with it.
ts = msprime.load("metadata-example.hdf5")
schema = json.loads(ts.node_metadata_schema)
builder = pjs.ObjectBuilder(json.loads(self.schema))
ns = builder.build_classes()
def decoder(encoded):
   return ns.ExampleMetadata.from_json(encoded.decode())

# If a decoder function is set, accesses to the ``node.metadata`` are intercepted and the decoder
# function is use to translate the raw bytes into a Python object.
ts.node_metadata_decoder = decoder
node = ts.node(0)
node.metadata.one  # == "one"
node.metadata.two  # == "one"

This is all a bit vague, but I thought I’d open an issue to start discussion on the topic. This is definitely not something we should tackle before 0.5.0 is released — none of this is backwards incompatible with what we have now.

Issue Analytics

State:
Created 6 years ago
Comments:10 (10 by maintainers)

Top GitHub Comments

1reaction

jeromekellehercommented, Mar 18, 2020

In my mind the consumers vastly out-number creators, and they will often be people with limited programming experience. Our job is to make their lives as easy as possible, and to make it easy for them to do the right thing.

1reaction

jeromekellehercommented, Mar 13, 2020

Do we care about the C API providing encode/decode? I don’t yet have a good feel for where the high-level / low-level boundary is.

This is definitely a non-goal. It’s just too complicated I think, and people can build what they need themselves on top of the raw bytes we’re providing.

Top Results From Across the Web

Introduction to Metadata: Setting the Stage

Data structure standards (metadata element sets, schemas). ... The Metadata Encoding and Transmission Standard (METS), developed by the Digital Library ...

Understanding Metadata

Expressed using the XML schema language, METS provides a document format for encoding the metadata necessary for manage- ment of digital library objects...

Working with Metadata — Tree Sequence Tutorials - tskit-dev

Reading metadata and schemas#. Metadata is automatically decoded using the schema when accessed via a TreeSequence or TableCollection Python API. For example:.

Key Concepts - Metadata Basics

There are three main types of metadata: descriptive, ... Structure standards are also known as schemes, schemas, or element sets.

What is metadata and how does it work? - TechTarget

Metadata is data that describes other data, providing a structured reference that helps to sort and identify attributes of the information it describes....