Cellxgene schema 3.0.0 breaks dataloader
See original GitHub issueHi there, I’m trying to use the data loader to access cellxgene collections (following the tutorial). I’ve run this before without problems but now it throws a new KeyError
. I think this has to do with the fact that cellxgene changed their metadata column to self_reported_ethnicity
.
To Reproduce
import anndata
import os
import sfaira
cache_path = os.path.join(".", "data")
dsg = sfaira.data.dataloaders.databases.DatasetSuperGroupDatabases(data_path=cache_path, cache_metadata=True)
Traceback:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [5], in <cell line: 6>()
3 import sfaira
5 cache_path = os.path.join(".", "data2")
----> 6 dsg = sfaira.data.dataloaders.databases.DatasetSuperGroupDatabases(data_path=cache_path, cache_metadata=True)
File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/super_group.py:17, in DatasetSuperGroupDatabases.__init__(self, data_path, meta_path, cache_path, cache_metadata)
9 def __init__(
10 self,
11 data_path: Union[str, None] = None,
(...)
14 cache_metadata: bool = False,
15 ):
16 dataset_super_groups = [
---> 17 DatasetSuperGroupCellxgene(
18 data_path=data_path,
19 meta_path=meta_path,
20 cache_path=cache_path,
21 cache_metadata=cache_metadata,
22 ),
23 ]
24 super().__init__(dataset_groups=dataset_super_groups)
File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:83, in DatasetSuperGroupCellxgene.__init__(self, data_path, meta_path, cache_path, cache_metadata, verbose)
81 print("WARNING: Zero cellxgene collections retrieved.")
82 # Note that the collection itself is not passed to DatasetGroupCellxgene but only the ID string.
---> 83 dataset_groups = [
84 DatasetGroupCellxgene(
85 collection_id=x["id"],
86 data_path=data_path,
87 meta_path=meta_path,
88 cache_path=cache_path,
89 cache_metadata=cache_metadata,
90 verbose=verbose,
91 )
92 for x in collections
93 ]
94 super().__init__(dataset_groups=dataset_groups)
File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:84, in <listcomp>(.0)
81 print("WARNING: Zero cellxgene collections retrieved.")
82 # Note that the collection itself is not passed to DatasetGroupCellxgene but only the ID string.
83 dataset_groups = [
---> 84 DatasetGroupCellxgene(
85 collection_id=x["id"],
86 data_path=data_path,
87 meta_path=meta_path,
88 cache_path=cache_path,
89 cache_metadata=cache_metadata,
90 verbose=verbose,
91 )
92 for x in collections
93 ]
94 super().__init__(dataset_groups=dataset_groups)
File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:33, in DatasetGroupCellxgene.__init__(self, collection_id, data_path, meta_path, cache_path, cache_metadata, verbose)
31 loader_pydoc_path_sfaira = "sfaira.data.dataloaders.databases.cellxgene.cellxgene_loader"
32 load_func = pydoc.locate(loader_pydoc_path_sfaira + ".load")
---> 33 datasets = [
34 Dataset(
35 collection_id=collection_id,
36 data_path=data_path,
37 meta_path=meta_path,
38 cache_path=cache_path,
39 load_func=load_func,
40 sample_fn=x,
41 sample_fns=dataset_ids,
42 cache_metadata=cache_metadata,
43 verbose=verbose,
44 )
45 for x in dataset_ids
46 ]
47 keys = [x.id for x in datasets]
48 super().__init__(datasets=dict(zip(keys, datasets)), collection_id=collection_id)
File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:34, in <listcomp>(.0)
31 loader_pydoc_path_sfaira = "sfaira.data.dataloaders.databases.cellxgene.cellxgene_loader"
32 load_func = pydoc.locate(loader_pydoc_path_sfaira + ".load")
33 datasets = [
---> 34 Dataset(
35 collection_id=collection_id,
36 data_path=data_path,
37 meta_path=meta_path,
38 cache_path=cache_path,
39 load_func=load_func,
40 sample_fn=x,
41 sample_fns=dataset_ids,
42 cache_metadata=cache_metadata,
43 verbose=verbose,
44 )
45 for x in dataset_ids
46 ]
47 keys = [x.id for x in datasets]
48 super().__init__(datasets=dict(zip(keys, datasets)), collection_id=collection_id)
File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_loader.py:107, in Dataset.__init__(self, collection_id, data_path, meta_path, cache_path, load_func, dict_load_func_annotation, yaml_path, sample_fn, sample_fns, additional_annotation_key, cache_metadata, verbose, **kwargs)
105 reordered_keys = ["organism"] + [x for x in self._adata_ids_cellxgene.dataset_keys if x != "organism"]
106 for k in reordered_keys:
--> 107 val = self._collection_dataset[getattr(self._adata_ids_cellxgene, k)]
108 # Unique label if list is length 1:
109 # Otherwise do not set property and resort to cell-wise labels.
110 v_clean = clean_cellxgene_meta_uns(k=k, val=val, adata_ids=self._adata_ids_cellxgene)
KeyError: 'ethnicity'
System:
- sfaira version: v0.3.12
- OS: Ubuntu 20.04.1 LTS
- Python 3.10.6
- Virtual environment: Conda
Issue Analytics
- State:
- Created a year ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Issues · theislab/sfaira - GitHub
Cellxgene schema 3.0.0 breaks dataloader addressed bug Something isn't working ... Dataloader: The immunoregulatory landscape of human tuberculosis ...
Read more >Best practices for data from CellXgene - Biostars
I downloaded a dataset from CellXGene in hdf5 format and have been trying to use it for further analyses. Are there any standard...
Read more >CZ CELLxGENE Discover - Cellular Visualization Tool
Chan Zuckerberg CELLxGENE Discover is a tool to find, download, and visually explore curated and standardized single-cell biology datasets.
Read more >CELL×GENE | Documentation - CZ CELLxGENE
... exploring, and reusing single cell data that adhere to a common schema that facilitates easy, intuitive exploration and integration.
Read more >Downloading Published Data on CZ CELLxGENE Discover
8) and rds (Seurat v4) formats. All datasets adhere to the CELLxGENE single cell annotated data schema. Datasets can either be downloaded via...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Update, I expect to merge the fix into dev this week.
The fix is merged into
dev
now and it seems working. We will need to be careful with using this with existing schema version 2 data sets but it should work well with data downloaded entirely under version 3. Let me know if any more issues come up, especially in continued work with version 2. I will wait with releasing until it is clear that this is stable on all applications.