Support for nullable bool, int in dataframes
See original GitHub issueWhat needs to happen
Support for nullable dtypes during IO. Allow for writing pandas string, integer, and boolean arrays (which can have null values) by saving a “null” mask along with them.
Example
import anndata as ad, pandas as pd, numpy as np
a = ad.AnnData(np.ones((3, 3)))
# Works fine
a.obs["np_bool"] = np.zeros(3, dtype=bool)
a.write("tmp.h5ad")
# Errors at write
a.obs["pd_bool"] = a.obs["np_bool"].astype(pd.BooleanDtype())
a.write("tmp.h5ad")
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
Above error raised while writing key 'pd_bool' of <class 'h5py._hl.group.Group'> from /.
Above error raised while writing key 'obs' of <class 'h5py._hl.files.File'> from /.
Full traceback
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
208 try:
--> 209 return func(elem, key, val, *args, **kwargs)
210 except Exception as e:
~/github/anndata/anndata/_io/h5ad.py in write_series(group, key, series, dataset_kwargs)
290 else:
--> 291 group[key] = series.values
292
/usr/local/lib/python3.8/site-packages/h5py/_hl/group.py in __setitem__(self, name, obj)
410 else:
--> 411 ds = self.create_dataset(None, data=obj)
412 h5o.link(ds.id, self.id, name, lcpl=lcpl)
/usr/local/lib/python3.8/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
147
--> 148 dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
149 dset = dataset.Dataset(dsid)
/usr/local/lib/python3.8/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter)
88 dtype = numpy.dtype(dtype)
---> 89 tid = h5t.py_create(dtype, logical=1)
90
h5py/h5t.pyx in h5py.h5t.py_create()
h5py/h5t.pyx in h5py.h5t.py_create()
h5py/h5t.pyx in h5py.h5t.py_create()
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
208 try:
--> 209 return func(elem, key, val, *args, **kwargs)
210 except Exception as e:
~/github/anndata/anndata/_io/h5ad.py in write_dataframe(f, key, df, dataset_kwargs)
264 for col_name, (_, series) in zip(col_names, df.items()):
--> 265 write_series(group, col_name, series, dataset_kwargs=dataset_kwargs)
266
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
211 parent = _get_parent(elem)
--> 212 raise type(e)(
213 f"{e}\n\n"
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
Above error raised while writing key 'pd_bool' of <class 'h5py._hl.group.Group'> from /.
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
<ipython-input-13-32812d0f937a> in <module>
1 a.obs["pd_bool"] = a.obs["np_bool"].astype(pd.BooleanDtype())
----> 2 a.write("tmp.h5ad")
~/github/anndata/anndata/_core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense)
1877 filename = self.filename
1878
-> 1879 _write_h5ad(
1880 Path(filename),
1881 self,
~/github/anndata/anndata/_io/h5ad.py in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs)
109 else:
110 write_attribute(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs)
--> 111 write_attribute(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
112 write_attribute(f, "var", adata.var, dataset_kwargs=dataset_kwargs)
113 write_attribute(f, "obsm", adata.obsm, dataset_kwargs=dataset_kwargs)
/usr/local/Cellar/python@3.8/3.8.6_2/Frameworks/Python.framework/Versions/3.8/lib/python3.8/functools.py in wrapper(*args, **kw)
873 '1 positional argument')
874
--> 875 return dispatch(args[0].__class__)(*args, **kw)
876
877 funcname = getattr(func, '__name__', 'singledispatch function')
~/github/anndata/anndata/_io/h5ad.py in write_attribute_h5ad(f, key, value, *args, **kwargs)
130 if key in f:
131 del f[key]
--> 132 _write_method(type(value))(f, key, value, *args, **kwargs)
133
134
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
210 except Exception as e:
211 parent = _get_parent(elem)
--> 212 raise type(e)(
213 f"{e}\n\n"
214 f"Above error raised while writing key {key!r} of {type(elem)}"
TypeError: Object dtype dtype('O') has no native HDF5 equivalent
Above error raised while writing key 'pd_bool' of <class 'h5py._hl.group.Group'> from /.
Above error raised while writing key 'obs' of <class 'h5py._hl.files.File'> from /.
I have a report from the wild of writing working here, but reading (by cellxgene) failing.
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
Support for nullable bool, int in dataframes · Issue #504 - GitHub
Support for nullable dtypes during IO. Allow for writing pandas string, integer, and boolean arrays (which can have null values) by saving a...
Read more >Is there a nullable boolean type I can use in a Pandas ...
Python's built-in bool class cannot have a Null value. It can only be True or False. And in this case, because bool(None)==False the...
Read more >Nullable Boolean data type — pandas 1.5.2 documentation
pandas allows indexing with NA values in a boolean array, which are treated as False . Changed in version 1.0. 2. If you...
Read more >Dealing with null in Spark - MungingData
Let's create a DataFrame with a name column that isn't nullable and an age column that is nullable. The name column cannot take...
Read more >Towards consistent missing value handling in Pandas
Also boolean data (in addition to integer data) do not support ... of a pandas DataFrame, has no built-in support for missing values....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I generally convert all problematic variables to strings
obs['x'].astype(str)
I am wondering what’s the progress on this issue. It is very annoying when analysis results don’t get saved after several hours of work on HPC because a new column popped up with unsave-able object type (in a script that worked just fine the other day, e.g. no need to test for save-ability). So I would really appreciate if this is addressed.
Maybe you can do a temporary workaround that converts such objects to strings with a warning?