Error saving results in run_regression
See original GitHub issueHi,
Thanks very much for developing this nice tool.
I am following the following tutorial to estimate the cell type signatures from my own scRNAseq dataset. I am having an error while calling to the run_regression
function. Everything looks fine, the epoch VS ELBO loss plots are generated as well as the ones for the UMI counts. However, I am having this error when saving the results:
### Saving results ###
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
~/.conda/envs/cellpymc/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
187 try:
--> 188 return func(elem, key, val, *args, **kwargs)
189 except Exception as e:
~/.conda/envs/cellpymc/lib/python3.7/site-packages/anndata/_io/h5ad.py in write_dataframe(f, key, df, dataset_kwargs)
257 group.attrs["encoding-version"] = EncodingVersions.dataframe.value
--> 258 group.attrs["column-order"] = list(df.columns)
259
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
~/.conda/envs/cellpymc/lib/python3.7/site-packages/h5py/_hl/attrs.py in __setitem__(self, name, value)
102 """
--> 103 self.create(name, data=value)
104
~/.conda/envs/cellpymc/lib/python3.7/site-packages/h5py/_hl/attrs.py in create(self, name, data, shape, dtype)
196 try:
--> 197 attr = h5a.create(self._id, self._e(tempname), htype, space)
198 except:
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5a.pyx in h5py.h5a.create()
RuntimeError: Unable to create attribute (object header message is too large)
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-601-cf8487f6980c> in <module>
30 export_args={'path': results_folder + 'regression_model/', # where to save results
31 'save_model': True, #save pytorch model?
---> 32 'run_name_suffix': ''})
33
34 reg_mod = r['mod']
~/.conda/envs/cellpymc/lib/python3.7/site-packages/cell2location/run_regression.py in run_regression(sc_data, model_name, verbose, return_all, train_args, model_kwargs, posterior_args, export_args)
325
326 # save anndata with exported posterior
--> 327 sc_data.write(filename=path + 'sc.h5ad', compression='gzip')
328
329 # save model object and related annotations
~/.conda/envs/cellpymc/lib/python3.7/site-packages/anndata/_core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense)
1848 compression_opts=compression_opts,
1849 force_dense=force_dense,
-> 1850 as_dense=as_dense,
1851 )
1852
~/.conda/envs/cellpymc/lib/python3.7/site-packages/anndata/_io/h5ad.py in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs)
115 )
116 else:
--> 117 write_attribute(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs)
118 write_attribute(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
119 write_attribute(f, "var", adata.var, dataset_kwargs=dataset_kwargs)
~/.conda/envs/cellpymc/lib/python3.7/functools.py in wrapper(*args, **kw)
838 '1 positional argument')
839
--> 840 return dispatch(args[0].__class__)(*args, **kw)
841
842 funcname = getattr(func, '__name__', 'singledispatch function')
~/.conda/envs/cellpymc/lib/python3.7/site-packages/anndata/_io/h5ad.py in write_attribute_h5ad(f, key, value, *args, **kwargs)
137 if key in f:
138 del f[key]
--> 139 _write_method(type(value))(f, key, value, *args, **kwargs)
140
141
~/.conda/envs/cellpymc/lib/python3.7/site-packages/anndata/_io/h5ad.py in write_raw(f, key, value, dataset_kwargs)
146 group.attrs["shape"] = value.shape
147 write_attribute(f, "raw/X", value.X, dataset_kwargs=dataset_kwargs)
--> 148 write_attribute(f, "raw/var", value.var, dataset_kwargs=dataset_kwargs)
149 write_attribute(f, "raw/varm", value.varm, dataset_kwargs=dataset_kwargs)
150
~/.conda/envs/cellpymc/lib/python3.7/functools.py in wrapper(*args, **kw)
838 '1 positional argument')
839
--> 840 return dispatch(args[0].__class__)(*args, **kw)
841
842 funcname = getattr(func, '__name__', 'singledispatch function')
~/.conda/envs/cellpymc/lib/python3.7/site-packages/anndata/_io/h5ad.py in write_attribute_h5ad(f, key, value, *args, **kwargs)
137 if key in f:
138 del f[key]
--> 139 _write_method(type(value))(f, key, value, *args, **kwargs)
140
141
~/.conda/envs/cellpymc/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
193 f"Above error raised while writing key {key!r} of {type(elem)}"
194 f" from {parent}."
--> 195 ) from e
196
197 return func_wrapper
RuntimeError: Unable to create attribute (object header message is too large)
Above error raised while writing key 'raw/var' of <class 'h5py._hl.files.File'> from /.
So far, I have found something like this:
HDF5 has a header limit of 64kb for all metadata of the columns. This include name, types, etc. When you go about roughly 2000 columns, you will run out of space to store all the metadata. This is a fundamental limitation of pytables. I don’t think they will make workarounds on their side any time soon. You will either have to split the table up or choose another storage format.
Do you have any idea why the object header could become so long?
Best regards and thank you very much, Alberto.
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
Hi
Not a c2l dev but had similar issues. The way it’s designed, each cell type in the reference gets multiple accompanying columns so the number of columns can increase very rapidly with number of cell types. I don’t think the designers of h5ad anticipated that 😃
I also found the h5ad export slow and sometimes prone to crashing for large files so I ended up using
joblib.dump
(here) (a variant of pickle.dump but with the additional benefit of compression, which helps a lot for file size). I can share the code but in run_c2l.py, wherever you seewrite
you can use joblib.dump for the exportSimilarly, some of the .csv files it exports can be useful but also slow down a lot for large spatial data/many cell types (it’s also not compressed) so if you’re also running large numbers of cell types, it might be worth removing that part of the code, and regenerating as needed
For what it’s worth, I’ve found it very handy to run a small 5-iteration model all the way through for new data first, before doing the full model, so it doesn’t train the whole thing and then quit before saving
Oh just to add this, if you go with joblib exporting, you’ll also have to change the importing steps to go from reading the h5ad to joblib.load
Added informative errors in https://github.com/BayraktarLab/cell2location/commit/d4c4e7f0407af6c9c19a64334e094dafcfe3114e and https://github.com/BayraktarLab/cell2location/commit/8eec6da0f7c6c4ef2025c8c0682af6df90e45312