Conventions for keeping track of creation- and modification-dates
See original GitHub issueAn unfortunate aspect of the HDF5 format is that opening a file makes the operating system treat it as modified, even if nothing changed and even if it is opened in read-only mode. On Windows this results in a changed modification date, and on Linux and OSX there is no distinction between modification or creation dates to begin with.
It would be useful to be able to keep track of real changes to loom files. This could be used to trigger automatic updates in various work-flows, for example. Having different levels of granularity about what was changed would also be useful here. It would also be useful if there was a convention for this, so that modification-detecting scripts from different groups would work with each other’s files without much trouble. If the loompy
library would inherently keep track of and update these modification tracking attributes, things would be even easier, since people would not have to think about it.
To give a concrete example: to serve data from a loom file to a website, a loom-viewer
server must extract the data from the loom file and convert it to JSON. This is a relatively slow process, and on top of that h5py does not like it when an HDF5 file is opened by multiple processes (even in read-only mode). So to mitigate this issue, whenever JSON data is generated it is also cached as a zipped static file. The next time someone requests that data, the static file is served instead of repeating the whole process.
The problems start if the data in the loom file is modified (for example, when a column attribute is added to loom file). At this point, existing JSON files that are outdated have to be replaced. It is currently not possible to detect when to do this automatically - it needs to be done manually by whomever is modifying the loom file.
One way around this would be to have a global attribute, or multiple attributes, that are used to keep track of real file changes. Being able to distinguish different kinds of modifications would be nice too. For the loom-viewer
the following level of precision is enough:
- file metadata,
- attributes (global, row and column)
- the data matrix as a whole
But perhaps other people have a use for more fine-grained checks (detecting which rows were changed, for example).
I would like to hear the thoughts of others on this, and come up with a shared proposal for how to handle this.
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (8 by maintainers)
Top GitHub Comments
Also, I made a function
timestamp()
in loompy which generates the timestamp in the correct format. Best to use this if you plan on generating dates for comparison withlast_modified()
.I implemented modification timestamps as follows:
In the loom file itself
The HDF5 attribute
last_modified
is set to an ISO8601 timestamp in the UTC timezone in the compact format (e.g.20180124T100436.901000Z
).The
last_modified
HDF5 attribute is set on:The modification timestamp at any level indicates the most recent modification time for any item below it in the HDF5 hierarchy.
In loompy
ds.last_modified()
: Modification timestamp for whole file. Will timestamp the file if it doesn’t have a timestamp already.ds.layers.last_modified()
: Timestamp for layersds.layers.last_modified(name)
: Timestamp for specific layerds.col_attrs.last_modified()
: Timestamp for column attributesds.col_attrs.last_modified(name)
: Timestamp for specific column attributeAnd so on for row attrs and graphs.
Finally, you can get a changeset relative to a given timestamp, like so:
ds.get_changes_since(timestamp)
: returns a dictionary of layers, attributes and graphs that have been modified since the given timestamp. For example:Returns