Performance pd.HDFStore().keys() slow
See original GitHub issueCode Sample, a copy-pastable example if possible
import pandas as pd, numpy as np
path = 'test.h5'
dataframes = [pd.DataFrame(np.random.rand(500, 100))for i in range(3000)]
with pd.HDFStore(path) as store:
for i, df in enumerate(dataframes):
store.put('test' + str(i), df)
%timeit store = pd.HDFStore(path).keys()
Problem description
The performance of pd.HDFStore().keys() is incredibly slow for a large store containing many dataframes. 10.6 secs for the above code to just get a list of keys in the store.
It appears the issue is related to the path_walk in tables requiring every single node be loaded to check whether it is a group.
/tables/file.py
def iter_nodes(self, where, classname=None):
"""Iterate over children nodes hanging from where.
**group = self.get_node(where)** # Does the parent exist?
self._check_group(group) # Is it a group?
return group._f_iter_nodes(classname)
%lprun -f store._handle.iter_nodes store.keys()
Timer unit: 2.56e-07 s
Total time: 0.0424965 s
File: D:\Anaconda3\lib\site-packages\tables\file.py
Function: iter_nodes at line 1998
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1998 def iter_nodes(self, where, classname=None):
1999 """Iterate over children nodes hanging from where.
2000
2001 Parameters
2002 ----------
2003 where
2004 This argument works as in :meth:`File.get_node`, referencing the
2005 node to be acted upon.
2006 classname
2007 If the name of a class derived from
2008 Node (see :ref:`NodeClassDescr`) is supplied, only instances of
2009 that class (or subclasses of it) will be returned.
2010
2011 Notes
2012 -----
2013 The returned nodes are alphanumerically sorted by their name.
2014 This is an iterator version of :meth:`File.list_nodes`.
2015
2016 """
2017
2018 6001 125237 20.9 75.4 group = self.get_node(where) # Does the parent exist?
2019 6001 26549 4.4 16.0 self._check_group(group) # Is it a group?
2020
2021 6001 14216 2.4 8.6 return group._f_iter_nodes(classname)
Therefore if the dataframes are large and you have a lot in one store this can take forever. (my real life code takes 1min to do this). My version of pandas is older but I don’t think this has been fixed in subsequent versions.
Also not sure whether to raise this in pandas or tables.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.0.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.19.2 nose: 1.3.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.11.3 scipy: 0.18.1 statsmodels: 0.6.1 xarray: None IPython: 5.1.0 sphinx: 1.5.1 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: 1.2.0 tables: 3.2.2 numexpr: 2.6.1 matplotlib: 2.0.0 openpyxl: 2.4.1 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.2 bs4: 4.5.3 html5lib: None httplib2: None apiclient: None sqlalchemy: 1.1.5 pymysql: None psycopg2: None jinja2: 2.9.4 boto: 2.45.0 pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:5 (3 by maintainers)
Top GitHub Comments
Found a temporary solution (at least for Python 3.6.6). Downgrade to Pandas 0.20.3 (I’ve had better overall performance with this version anyway). Then use the root attribute and the built-in dir() method:
viola! Goodluck
https://github.com/pandas-dev/pandas/pull/21543
improved the performance of
.groups
. Is.keys
still slow?Can you see if we have open issues for those slowdowns?