Fix netcdf string compression
See original GitHub issueJust opening an issue to record my experiments for reference.
With the current way we save strings, the zlib
compression doesn’t work. I was able to make it work by saving strings the old way as a character array.
str_out = netcdf.stringtochar(np.array(python_string, 'S4'))
dimension = len(str_out)
dimension_name = 'string' + str(dimension)
if dimension_name not in self._storage_dict[storage].dimensions:
self._storage_dict[storage].createDimension(dimension_name, dimension)
nc_variable = self._storage_dict[storage].createVariable(name, 'S1', (dim_name,), zlib=True)
nc_variable[:] = str_out
With this code the resulting nc file that contained the serialization of alchemically modified Src got reduced from 52M to 32K!!
I didn’t test the restoring so we may have to tweak it a little. Also, with this implementation it may be impossible to update strings since I’m not sure you can change the dimension of a variable after its creation. The point is that maybe netcdf doesn’t support the compression of the new variable-length str
type variables.
Worse case scenario, we could save permanently the expensive bits (thermodynamic states, topology, …), and save as vlen strings the things that need to be updated (i.e. mcmcmoves statistics).
Issue Analytics
- State:
- Created 6 years ago
- Comments:10 (10 by maintainers)
Top GitHub Comments
After adding both the OpenMMTools and the YANK fixes, here are some numbers for an explicit solvent setup of FXR, with anisotropic dispersion correction, 0 iterations taken (so initial file size)
Speed of Serialization step Pre-comp : 195 s Post-comp : 49 s Factor : ~4x
I think this makes strong progress towards fixing this. The file sizes are still a bit large, but if acceptable, we can probably close this issue.
My thoughts exactly.