Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cellxgene schema 3.0.0 breaks dataloader

See original GitHub issue

Hi there, I’m trying to use the data loader to access cellxgene collections (following the tutorial). I’ve run this before without problems but now it throws a new KeyError. I think this has to do with the fact that cellxgene changed their metadata column to self_reported_ethnicity.

To Reproduce

import anndata
import os
import sfaira

cache_path = os.path.join(".", "data")
dsg = sfaira.data.dataloaders.databases.DatasetSuperGroupDatabases(data_path=cache_path, cache_metadata=True)

Traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [5], in <cell line: 6>()
      3 import sfaira
      5 cache_path = os.path.join(".", "data2")
----> 6 dsg = sfaira.data.dataloaders.databases.DatasetSuperGroupDatabases(data_path=cache_path, cache_metadata=True)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/super_group.py:17, in DatasetSuperGroupDatabases.__init__(self, data_path, meta_path, cache_path, cache_metadata)
      9 def __init__(
     10         self,
     11         data_path: Union[str, None] = None,
   (...)
     14         cache_metadata: bool = False,
     15 ):
     16     dataset_super_groups = [
---> 17         DatasetSuperGroupCellxgene(
     18             data_path=data_path,
     19             meta_path=meta_path,
     20             cache_path=cache_path,
     21             cache_metadata=cache_metadata,
     22         ),
     23     ]
     24     super().__init__(dataset_groups=dataset_super_groups)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:83, in DatasetSuperGroupCellxgene.__init__(self, data_path, meta_path, cache_path, cache_metadata, verbose)
     81     print("WARNING: Zero cellxgene collections retrieved.")
     82 # Note that the collection itself is not passed to DatasetGroupCellxgene but only the ID string.
---> 83 dataset_groups = [
     84     DatasetGroupCellxgene(
     85         collection_id=x["id"],
     86         data_path=data_path,
     87         meta_path=meta_path,
     88         cache_path=cache_path,
     89         cache_metadata=cache_metadata,
     90         verbose=verbose,
     91     )
     92     for x in collections
     93 ]
     94 super().__init__(dataset_groups=dataset_groups)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:84, in <listcomp>(.0)
     81     print("WARNING: Zero cellxgene collections retrieved.")
     82 # Note that the collection itself is not passed to DatasetGroupCellxgene but only the ID string.
     83 dataset_groups = [
---> 84     DatasetGroupCellxgene(
     85         collection_id=x["id"],
     86         data_path=data_path,
     87         meta_path=meta_path,
     88         cache_path=cache_path,
     89         cache_metadata=cache_metadata,
     90         verbose=verbose,
     91     )
     92     for x in collections
     93 ]
     94 super().__init__(dataset_groups=dataset_groups)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:33, in DatasetGroupCellxgene.__init__(self, collection_id, data_path, meta_path, cache_path, cache_metadata, verbose)
     31 loader_pydoc_path_sfaira = "sfaira.data.dataloaders.databases.cellxgene.cellxgene_loader"
     32 load_func = pydoc.locate(loader_pydoc_path_sfaira + ".load")
---> 33 datasets = [
     34     Dataset(
     35         collection_id=collection_id,
     36         data_path=data_path,
     37         meta_path=meta_path,
     38         cache_path=cache_path,
     39         load_func=load_func,
     40         sample_fn=x,
     41         sample_fns=dataset_ids,
     42         cache_metadata=cache_metadata,
     43         verbose=verbose,
     44     )
     45     for x in dataset_ids
     46 ]
     47 keys = [x.id for x in datasets]
     48 super().__init__(datasets=dict(zip(keys, datasets)), collection_id=collection_id)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:34, in <listcomp>(.0)
     31 loader_pydoc_path_sfaira = "sfaira.data.dataloaders.databases.cellxgene.cellxgene_loader"
     32 load_func = pydoc.locate(loader_pydoc_path_sfaira + ".load")
     33 datasets = [
---> 34     Dataset(
     35         collection_id=collection_id,
     36         data_path=data_path,
     37         meta_path=meta_path,
     38         cache_path=cache_path,
     39         load_func=load_func,
     40         sample_fn=x,
     41         sample_fns=dataset_ids,
     42         cache_metadata=cache_metadata,
     43         verbose=verbose,
     44     )
     45     for x in dataset_ids
     46 ]
     47 keys = [x.id for x in datasets]
     48 super().__init__(datasets=dict(zip(keys, datasets)), collection_id=collection_id)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_loader.py:107, in Dataset.__init__(self, collection_id, data_path, meta_path, cache_path, load_func, dict_load_func_annotation, yaml_path, sample_fn, sample_fns, additional_annotation_key, cache_metadata, verbose, **kwargs)
    105 reordered_keys = ["organism"] + [x for x in self._adata_ids_cellxgene.dataset_keys if x != "organism"]
    106 for k in reordered_keys:
--> 107     val = self._collection_dataset[getattr(self._adata_ids_cellxgene, k)]
    108     # Unique label if list is length 1:
    109     # Otherwise do not set property and resort to cell-wise labels.
    110     v_clean = clean_cellxgene_meta_uns(k=k, val=val, adata_ids=self._adata_ids_cellxgene)

KeyError: 'ethnicity'

System:

sfaira version: v0.3.12
OS: Ubuntu 20.04.1 LTS
Python 3.10.6
Virtual environment: Conda

Issue Analytics

State:
Created a year ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

davidsebfischercommented, Nov 21, 2022

Update, I expect to merge the fix into dev this week.

0reactions

davidsebfischercommented, Nov 22, 2022

The fix is merged into dev now and it seems working. We will need to be careful with using this with existing schema version 2 data sets but it should work well with data downloaded entirely under version 3. Let me know if any more issues come up, especially in continued work with version 2. I will wait with releasing until it is clear that this is stable on all applications.

Top Results From Across the Web

Issues · theislab/sfaira - GitHub

Cellxgene schema 3.0.0 breaks dataloader addressed bug Something isn't working ... Dataloader: The immunoregulatory landscape of human tuberculosis ...

Best practices for data from CellXgene - Biostars

I downloaded a dataset from CellXGene in hdf5 format and have been trying to use it for further analyses. Are there any standard...

CZ CELLxGENE Discover - Cellular Visualization Tool

Chan Zuckerberg CELLxGENE Discover is a tool to find, download, and visually explore curated and standardized single-cell biology datasets.

CELL×GENE | Documentation - CZ CELLxGENE

... exploring, and reusing single cell data that adhere to a common schema that facilitates easy, intuitive exploration and integration.

Downloading Published Data on CZ CELLxGENE Discover

8) and rds (Seurat v4) formats. All datasets adhere to the CELLxGENE single cell annotated data schema. Datasets can either be downloaded via...