Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: MemoryError on reading big HDF5 files

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd
store = pd.get_store('big1.h5')
i = 0
for df in store.select('/MeasurementSamples', chunksize=100):
    i += 1
print(i)
store.close()

Result:

Traceback (most recent call last):
  File "memerror.py", line 6, in <module>
    for df in store.select(experiment, chunksize=100):
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 721, in select
    return it.get_result()
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 1409, in get_result
    self.coordinates = self.s.read_coordinates(where=self.where)
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 3652, in read_coordinates
    coords = self.selection.select_coords()
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 4718, in select_coords
    return np.arange(start, stop)
MemoryError
Closing remaining open files:big1.h5...done

Problem description

I’m not able to iterate over the chunks of file when the index array is to big and cannot fit into memory. I can also mention that I’m able to view the data with ViTables (that use PyTables internally to load data).

I’m using more less following code to create file (writing to it long enough to have 20GB of data).

import tables as tb

class FreqSample(tb.IsDescription):
    tsf = tb.Int64Col(dflt=-1)  # [us] TSF value ticks in micro seconds
    timestamp = tb.Int64Col()  # [ns] Epoch time
    frequency = tb.Float64Col()
    power = tb.Float64Col()

h5filters = tb.Filters(complib='blosc', complevel=5)
h5file = tb.open_file(fname, mode="a",
           title=title,
           filters=h5filters)
tab = h5file.create_table('/Measurement', 'a',  FreqSample)

try:
    while True:
        row = tab.row
        row['tsf'] = 1
        row['timestamp'] = 2
        row['frequency'] = 3
        row['power'] = 4
        row.append()
except:
    pass
tab.autoindex = True
tab.flush()
h5file.close()

Expected Output

I would expect the above code prints number of chunks.

Output of `pd.show_versions()`

>>> pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.9.10-040910-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.0+739.g7b82e8b pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.4 patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: 3.3.0 numexpr: 2.6.2 feather: 0.3.1 matplotlib: 2.0.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999 sqlalchemy: 1.1.8 pymysql: None psycopg2: None jinja2: 2.9.5 s3fs: None pandas_gbq: None pandas_datareader: None