BUG: MemoryError on reading big HDF5 files
See original GitHub issueCode Sample, a copy-pastable example if possible
import pandas as pd
store = pd.get_store('big1.h5')
i = 0
for df in store.select('/MeasurementSamples', chunksize=100):
i += 1
print(i)
store.close()
Result:
Traceback (most recent call last):
File "memerror.py", line 6, in <module>
for df in store.select(experiment, chunksize=100):
File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 721, in select
return it.get_result()
File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 1409, in get_result
self.coordinates = self.s.read_coordinates(where=self.where)
File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 3652, in read_coordinates
coords = self.selection.select_coords()
File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 4718, in select_coords
return np.arange(start, stop)
MemoryError
Closing remaining open files:big1.h5...done
Problem description
I’m not able to iterate over the chunks of file when the index array is to big and cannot fit into memory. I can also mention that I’m able to view the data with ViTables
(that use PyTables
internally to load data).
I’m using more less following code to create file (writing to it long enough to have 20GB of data).
import tables as tb
class FreqSample(tb.IsDescription):
tsf = tb.Int64Col(dflt=-1) # [us] TSF value ticks in micro seconds
timestamp = tb.Int64Col() # [ns] Epoch time
frequency = tb.Float64Col()
power = tb.Float64Col()
h5filters = tb.Filters(complib='blosc', complevel=5)
h5file = tb.open_file(fname, mode="a",
title=title,
filters=h5filters)
tab = h5file.create_table('/Measurement', 'a', FreqSample)
try:
while True:
row = tab.row
row['tsf'] = 1
row['timestamp'] = 2
row['frequency'] = 3
row['power'] = 4
row.append()
except:
pass
tab.autoindex = True
tab.flush()
h5file.close()
Expected Output
I would expect the above code prints number of chunks.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.9.10-040910-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.19.0+739.g7b82e8b pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.4 patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: 3.3.0 numexpr: 2.6.2 feather: 0.3.1 matplotlib: 2.0.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999 sqlalchemy: 1.1.8 pymysql: None psycopg2: None jinja2: 2.9.5 s3fs: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Comments:17 (14 by maintainers)
Top GitHub Comments
Are there any solutions to this?
In this example yes. I expect it will be dependent on the amount of RAM.
In this case it will fail if number of rows * 8bytes per row (
np.arange
) is greater than systems RAM.