Metadata schemas and automatic decoding
See original GitHub issueThe metadata API allows you to store any byte string along with nodes, sites and mutations. This allows us to store pickled Python objects, for example, which will work fine. However, this isn’t great for portability:
- Obviously won’t work on any other language than Python
- Pickling/unpickling requires the class definition to be in the current namespace, which can get tricky. In practise, this would probably end up causing compatability headaches across code versions.
It would be better if we could use some third-party approach for encoding metadata so that we can automatically decode the stored information into a Python object using a given schema. I’m thinking of things like JSON-schema and protobuf. If the schema is also stored in the HDF5 file, then the schema goes along with the data, making it much more likely the data can be interpreted correctly later.
I’ve done a little initial experimentation with this in #330, which shows how we can use JSON-schema, where the user is responsible for managing the schemas and so on. Here is a sketch of how it might work in the future if we put in some infrastructure for automatically decoding metadata into Python objects via a few pluggable hooks:
# Your schema for custom metadata defined using json-schema
schema = """{
"title": "Example Metadata",
"type": "object",
"properties": {
"one": {"type": "string"},
"two": {"type": "string"}
},
"required": ["one", "two"]
}"""
# Load this schema into an object builder, so that we make metadata objects
# Using https://github.com/cwacek/python-jsonschema-objects
builder = pjs.ObjectBuilder(json.loads(self.schema))
ns = builder.build_classes()
# Make a new metadata object, and assign it to a row in the nodes table.
metadata = ns.ExampleMetadata(one="node1", two="node2")
encoded = json.dumps(metadata.as_dict()).encode()
nodes.add_row(time=0.125, metadata=encoded)
# Load this table into a tree sequence, and store the schema so it can be retrieved later.
ts = msprime.load_tables(nodes=nodes, edges=edges, sequence_length=1)
ts.node_metadata_schema = schema
ts.dump("metadata-example.hdf5")
# Later, on another system, etc, we can load up the metadata and work with it.
ts = msprime.load("metadata-example.hdf5")
schema = json.loads(ts.node_metadata_schema)
builder = pjs.ObjectBuilder(json.loads(self.schema))
ns = builder.build_classes()
def decoder(encoded):
return ns.ExampleMetadata.from_json(encoded.decode())
# If a decoder function is set, accesses to the ``node.metadata`` are intercepted and the decoder
# function is use to translate the raw bytes into a Python object.
ts.node_metadata_decoder = decoder
node = ts.node(0)
node.metadata.one # == "one"
node.metadata.two # == "one"
 This is all a bit vague, but I thought I’d open an issue to start discussion on the topic. This is definitely not something we should tackle before 0.5.0 is released — none of this is backwards incompatible with what we have now.
Issue Analytics
- State:
- Created 6 years ago
- Comments:10 (10 by maintainers)
In my mind the consumers vastly out-number creators, and they will often be people with limited programming experience. Our job is to make their lives as easy as possible, and to make it easy for them to do the right thing.
This is definitely a non-goal. It’s just too complicated I think, and people can build what they need themselves on top of the raw bytes we’re providing.