question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Conventions for keeping track of creation- and modification-dates

See original GitHub issue

An unfortunate aspect of the HDF5 format is that opening a file makes the operating system treat it as modified, even if nothing changed and even if it is opened in read-only mode. On Windows this results in a changed modification date, and on Linux and OSX there is no distinction between modification or creation dates to begin with.

It would be useful to be able to keep track of real changes to loom files. This could be used to trigger automatic updates in various work-flows, for example. Having different levels of granularity about what was changed would also be useful here. It would also be useful if there was a convention for this, so that modification-detecting scripts from different groups would work with each other’s files without much trouble. If the loompy library would inherently keep track of and update these modification tracking attributes, things would be even easier, since people would not have to think about it.

To give a concrete example: to serve data from a loom file to a website, a loom-viewer server must extract the data from the loom file and convert it to JSON. This is a relatively slow process, and on top of that h5py does not like it when an HDF5 file is opened by multiple processes (even in read-only mode). So to mitigate this issue, whenever JSON data is generated it is also cached as a zipped static file. The next time someone requests that data, the static file is served instead of repeating the whole process.

The problems start if the data in the loom file is modified (for example, when a column attribute is added to loom file). At this point, existing JSON files that are outdated have to be replaced. It is currently not possible to detect when to do this automatically - it needs to be done manually by whomever is modifying the loom file.

One way around this would be to have a global attribute, or multiple attributes, that are used to keep track of real file changes. Being able to distinguish different kinds of modifications would be nice too. For the loom-viewer the following level of precision is enough:

  • file metadata,
  • attributes (global, row and column)
  • the data matrix as a whole

But perhaps other people have a use for more fine-grained checks (detecting which rows were changed, for example).

I would like to hear the thoughts of others on this, and come up with a shared proposal for how to handle this.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
slinnarssoncommented, Jan 24, 2018

Also, I made a function timestamp() in loompy which generates the timestamp in the correct format. Best to use this if you plan on generating dates for comparison with last_modified().

1reaction
slinnarssoncommented, Jan 24, 2018

I implemented modification timestamps as follows:

In the loom file itself

The HDF5 attribute last_modified is set to an ISO8601 timestamp in the UTC timezone in the compact format (e.g. 20180124T100436.901000Z).

The last_modified HDF5 attribute is set on:

/ (the root of the file)
/matrix
/layers/{name}
/row_edges
/row_edges/{name}
/col_edges
/col_edges/{name}
/row_attrs
/row_attrs/{name}
/col_attrs
/col_attrs/{name}

The modification timestamp at any level indicates the most recent modification time for any item below it in the HDF5 hierarchy.

In loompy

ds.last_modified(): Modification timestamp for whole file. Will timestamp the file if it doesn’t have a timestamp already.

ds.layers.last_modified(): Timestamp for layers ds.layers.last_modified(name): Timestamp for specific layer

ds.col_attrs.last_modified(): Timestamp for column attributes ds.col_attrs.last_modified(name): Timestamp for specific column attribute

And so on for row attrs and graphs.

Finally, you can get a changeset relative to a given timestamp, like so:

ds.get_changes_since(timestamp): returns a dictionary of layers, attributes and graphs that have been modified since the given timestamp. For example:

with loompy.connect("/Users/sten/build_20171205_bak/L5_All.loom") as ds:
    print(ds.get_changes_since("20180124T100436.901000Z"))

Returns

{'row_graphs': [], 'col_graphs': [], 'row_attrs': ['Accession', 'Gene', '_LogCV', '_LogMean', '_Selected', '_Total', '_Valid'], 'col_attrs': ['Age', 'Bucket', 'CellID', 'Class', 'ClassProbability_Astrocyte', 'ClassProbability_Astrocyte,Immune', 'ClassProbability_Astrocyte,Neurons', 'ClassProbability_Astrocyte,Oligos', 'ClassProbability_Astrocyte,Vascular', 'ClassProbability_Bergmann-glia', 'ClassProbability_Blood', 'ClassProbability_Blood,Vascular', 'ClassProbability_Enteric-glia', 'ClassProbability_Enteric-glia,Cycling', 'ClassProbability_Ependymal', 'ClassProbability_Ex-Neurons', 'ClassProbability_Ex-Vascular', 'ClassProbability_Immune', 'ClassProbability_Immune,Neurons', 'ClassProbability_Immune,Oligos', 'ClassProbability_Neurons', 'ClassProbability_Neurons,Cycling', 'ClassProbability_Neurons,Oligos', 'ClassProbability_Neurons,Satellite-glia', 'ClassProbability_Neurons,Vascular', 'ClassProbability_OEC', 'ClassProbability_Oligos', 'ClassProbability_Oligos,Cycling', 'ClassProbability_Oligos,Vascular', 'ClassProbability_Satellite-glia', 'ClassProbability_Satellite-glia,Cycling', 'ClassProbability_Satellite-glia,Schwann', 'ClassProbability_Schwann', 'ClassProbability_Ttr', 'ClassProbability_Vascular', 'ClusterName', 'Clusters', 'Comment', 'Description', 'Developmental_compartment', 'LeafOrder', 'Location_based_on', 'MitoRiboRatio', 'Neurotransmitter', 'OriginalClusters', 'Outliers', 'Probable_location', 'Region', 'SampleID', 'Sex', 'Subclass', 'TaxonomyRank1', 'TaxonomyRank2', 'TaxonomyRank3', 'TaxonomyRank4', 'TaxonomySymbol', 'Taxonomy_group', 'Tissue', '_NGenes', '_Total', '_Valid', '_X', '_Y'], 'layers': ['']}
Read more comments on GitHub >

github_iconTop Results From Across the Web

Set modified date = created date or null on record creation?
With modified = created if you want the latest modifications with never edited ones included you can rely on the modified column.
Read more >
File Naming Conventions & Version Control
Include a 'version control table' with each important document, noting changes and their dates alongside the appropriate version number of the document. If ......
Read more >
What is the accuracy of file creation or modification dates?
File metadata (e.g. creation date, last modified, etc) is generally a matter of the file system, and can thus be modified using various...
Read more >
File Naming Conventions - HURIDOCS
If all the other words in the file name are the same, this convention will allow us to sort by year, then month,...
Read more >
Version Control: A Good Practice Guide - University of Glasgow
Version control is the process by which different drafts and versions of a document or record are managed. It is a tool which...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found