question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cellxgene schema 3.0.0 breaks dataloader

See original GitHub issue

Hi there, I’m trying to use the data loader to access cellxgene collections (following the tutorial). I’ve run this before without problems but now it throws a new KeyError. I think this has to do with the fact that cellxgene changed their metadata column to self_reported_ethnicity.

To Reproduce

import anndata
import os
import sfaira

cache_path = os.path.join(".", "data")
dsg = sfaira.data.dataloaders.databases.DatasetSuperGroupDatabases(data_path=cache_path, cache_metadata=True)

Traceback:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [5], in <cell line: 6>()
      3 import sfaira
      5 cache_path = os.path.join(".", "data2")
----> 6 dsg = sfaira.data.dataloaders.databases.DatasetSuperGroupDatabases(data_path=cache_path, cache_metadata=True)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/super_group.py:17, in DatasetSuperGroupDatabases.__init__(self, data_path, meta_path, cache_path, cache_metadata)
      9 def __init__(
     10         self,
     11         data_path: Union[str, None] = None,
   (...)
     14         cache_metadata: bool = False,
     15 ):
     16     dataset_super_groups = [
---> 17         DatasetSuperGroupCellxgene(
     18             data_path=data_path,
     19             meta_path=meta_path,
     20             cache_path=cache_path,
     21             cache_metadata=cache_metadata,
     22         ),
     23     ]
     24     super().__init__(dataset_groups=dataset_super_groups)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:83, in DatasetSuperGroupCellxgene.__init__(self, data_path, meta_path, cache_path, cache_metadata, verbose)
     81     print("WARNING: Zero cellxgene collections retrieved.")
     82 # Note that the collection itself is not passed to DatasetGroupCellxgene but only the ID string.
---> 83 dataset_groups = [
     84     DatasetGroupCellxgene(
     85         collection_id=x["id"],
     86         data_path=data_path,
     87         meta_path=meta_path,
     88         cache_path=cache_path,
     89         cache_metadata=cache_metadata,
     90         verbose=verbose,
     91     )
     92     for x in collections
     93 ]
     94 super().__init__(dataset_groups=dataset_groups)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:84, in <listcomp>(.0)
     81     print("WARNING: Zero cellxgene collections retrieved.")
     82 # Note that the collection itself is not passed to DatasetGroupCellxgene but only the ID string.
     83 dataset_groups = [
---> 84     DatasetGroupCellxgene(
     85         collection_id=x["id"],
     86         data_path=data_path,
     87         meta_path=meta_path,
     88         cache_path=cache_path,
     89         cache_metadata=cache_metadata,
     90         verbose=verbose,
     91     )
     92     for x in collections
     93 ]
     94 super().__init__(dataset_groups=dataset_groups)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:33, in DatasetGroupCellxgene.__init__(self, collection_id, data_path, meta_path, cache_path, cache_metadata, verbose)
     31 loader_pydoc_path_sfaira = "sfaira.data.dataloaders.databases.cellxgene.cellxgene_loader"
     32 load_func = pydoc.locate(loader_pydoc_path_sfaira + ".load")
---> 33 datasets = [
     34     Dataset(
     35         collection_id=collection_id,
     36         data_path=data_path,
     37         meta_path=meta_path,
     38         cache_path=cache_path,
     39         load_func=load_func,
     40         sample_fn=x,
     41         sample_fns=dataset_ids,
     42         cache_metadata=cache_metadata,
     43         verbose=verbose,
     44     )
     45     for x in dataset_ids
     46 ]
     47 keys = [x.id for x in datasets]
     48 super().__init__(datasets=dict(zip(keys, datasets)), collection_id=collection_id)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_group.py:34, in <listcomp>(.0)
     31 loader_pydoc_path_sfaira = "sfaira.data.dataloaders.databases.cellxgene.cellxgene_loader"
     32 load_func = pydoc.locate(loader_pydoc_path_sfaira + ".load")
     33 datasets = [
---> 34     Dataset(
     35         collection_id=collection_id,
     36         data_path=data_path,
     37         meta_path=meta_path,
     38         cache_path=cache_path,
     39         load_func=load_func,
     40         sample_fn=x,
     41         sample_fns=dataset_ids,
     42         cache_metadata=cache_metadata,
     43         verbose=verbose,
     44     )
     45     for x in dataset_ids
     46 ]
     47 keys = [x.id for x in datasets]
     48 super().__init__(datasets=dict(zip(keys, datasets)), collection_id=collection_id)

File ~/my-conda-envs/oor-benchmark/lib/python3.10/site-packages/sfaira/data/dataloaders/databases/cellxgene/cellxgene_loader.py:107, in Dataset.__init__(self, collection_id, data_path, meta_path, cache_path, load_func, dict_load_func_annotation, yaml_path, sample_fn, sample_fns, additional_annotation_key, cache_metadata, verbose, **kwargs)
    105 reordered_keys = ["organism"] + [x for x in self._adata_ids_cellxgene.dataset_keys if x != "organism"]
    106 for k in reordered_keys:
--> 107     val = self._collection_dataset[getattr(self._adata_ids_cellxgene, k)]
    108     # Unique label if list is length 1:
    109     # Otherwise do not set property and resort to cell-wise labels.
    110     v_clean = clean_cellxgene_meta_uns(k=k, val=val, adata_ids=self._adata_ids_cellxgene)

KeyError: 'ethnicity'

System:

  • sfaira version: v0.3.12
  • OS: Ubuntu 20.04.1 LTS
  • Python 3.10.6
  • Virtual environment: Conda

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
davidsebfischercommented, Nov 21, 2022

Update, I expect to merge the fix into dev this week.

0reactions
davidsebfischercommented, Nov 22, 2022

The fix is merged into dev now and it seems working. We will need to be careful with using this with existing schema version 2 data sets but it should work well with data downloaded entirely under version 3. Let me know if any more issues come up, especially in continued work with version 2. I will wait with releasing until it is clear that this is stable on all applications.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · theislab/sfaira - GitHub
Cellxgene schema 3.0.0 breaks dataloader addressed bug Something isn't working ... Dataloader: The immunoregulatory landscape of human tuberculosis ...
Read more >
Best practices for data from CellXgene - Biostars
I downloaded a dataset from CellXGene in hdf5 format and have been trying to use it for further analyses. Are there any standard...
Read more >
CZ CELLxGENE Discover - Cellular Visualization Tool
Chan Zuckerberg CELLxGENE Discover is a tool to find, download, and visually explore curated and standardized single-cell biology datasets.
Read more >
CELL×GENE | Documentation - CZ CELLxGENE
... exploring, and reusing single cell data that adhere to a common schema that facilitates easy, intuitive exploration and integration.
Read more >
Downloading Published Data on CZ CELLxGENE Discover
8) and rds (Seurat v4) formats. All datasets adhere to the CELLxGENE single cell annotated data schema. Datasets can either be downloaded via...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found