Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance pd.HDFStore().keys() slow

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd, numpy as np
path = 'test.h5'
dataframes = [pd.DataFrame(np.random.rand(500, 100))for i in range(3000)]
with pd.HDFStore(path) as store:
    for i, df in enumerate(dataframes):
        store.put('test' + str(i), df)
%timeit store = pd.HDFStore(path).keys()

Problem description

The performance of pd.HDFStore().keys() is incredibly slow for a large store containing many dataframes. 10.6 secs for the above code to just get a list of keys in the store.

It appears the issue is related to the path_walk in tables requiring every single node be loaded to check whether it is a group.

/tables/file.py

def iter_nodes(self, where, classname=None):
    """Iterate over children nodes hanging from where.

    **group = self.get_node(where)**  # Does the parent exist?
    self._check_group(group)  # Is it a group?

    return group._f_iter_nodes(classname)

%lprun -f store._handle.iter_nodes store.keys()
Timer unit: 2.56e-07 s
Total time: 0.0424965 s
File: D:\Anaconda3\lib\site-packages\tables\file.py
Function: iter_nodes at line 1998
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1998                                               def iter_nodes(self, where, classname=None):
  1999                                                   """Iterate over children nodes hanging from where.
  2000                                           
  2001                                                   Parameters
  2002                                                   ----------
  2003                                                   where
  2004                                                       This argument works as in :meth:`File.get_node`, referencing the
  2005                                                       node to be acted upon.
  2006                                                   classname
  2007                                                       If the name of a class derived from
  2008                                                       Node (see :ref:`NodeClassDescr`) is supplied, only instances of
  2009                                                       that class (or subclasses of it) will be returned.
  2010                                           
  2011                                                   Notes
  2012                                                   -----
  2013                                                   The returned nodes are alphanumerically sorted by their name.
  2014                                                   This is an iterator version of :meth:`File.list_nodes`.
  2015                                           
  2016                                                   """
  2017                                           
  2018      6001       125237     20.9     75.4          group = self.get_node(where)  # Does the parent exist?
  2019      6001        26549      4.4     16.0          self._check_group(group)  # Is it a group?
  2020                                           
  2021      6001        14216      2.4      8.6          return group._f_iter_nodes(classname)

Therefore if the dataframes are large and you have a lot in one store this can take forever. (my real life code takes 1min to do this). My version of pandas is older but I don’t think this has been fixed in subsequent versions.

Also not sure whether to raise this in pandas or tables.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.19.2 nose: 1.3.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.11.3 scipy: 0.18.1 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.5.1 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: 1.2.0 tables: 3.2.2 numexpr: 2.6.1 matplotlib: 2.0.0 openpyxl: 2.4.1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.2 bs4: 4.5.3 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.1.5 pymysql: None psycopg2: None jinja2: 2.9.4 boto: 2.45.0 pandas_datareader: None

Issue Analytics

State:
Created 6 years ago
Reactions:2
Comments:5 (3 by maintainers)

Top GitHub Comments

2reactions

ben-daghircommented, Jul 17, 2018

Found a temporary solution (at least for Python 3.6.6). Downgrade to Pandas 0.20.3 (I’ve had better overall performance with this version anyway). Then use the root attribute and the built-in dir() method:

store = pandas.HDFStore(file)
keys = dir(store.root)

viola! Goodluck

0reactions

TomAugspurgercommented, Jul 19, 2018

https://github.com/pandas-dev/pandas/pull/21543

improved the performance of .groups. Is .keys still slow?

I’ve had better overall performance with this version anyway

Can you see if we have open issues for those slowdowns?

Top Results From Across the Web

How to effiiciently rebuild pandas hdfstore table when ...

Below is an example of an efficient method for building large pandas hdfstores. The key is to cache the frame numbers when the...

Fast, Flexible, Easy and Intuitive: How to Speed Up Your ...

Use .iterrows() : iterate over DataFrame rows as (index, pd.Series ) pairs. While a Pandas Series is a flexible data structure, ...

IO tools (text, CSV, HDF5, …) — pandas 1.0.1 documentation

The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally return a pandas object. The corresponding...

Improve Query Performance From a Large HDFStore Table with ...

You should try this (this could make it faster / slower depending on ... In [11]: df.head() Out[11]: A ...

Stop persisting pandas data frames in CSVs

CSVs can be slow to read and write, they take more disk space, ... Contrary to .to_csv() .to_pickle() method accepts only 3 parameters....

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Performance pd.HDFStore().keys() slow

Code Sample, a copy-pastable example if possible

Problem description

Output of `pd.show_versions()`

INSTALLED VERSIONS

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

groupby with categorical type returns all combinations

TST/CI: PyArrow Test Failures

Performance pd.HDFStore().keys() slow

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

groupby with categorical type returns all combinations

TST/CI: PyArrow Test Failures

Output of `pd.show_versions()`