question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: MemoryError on reading big HDF5 files

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd
store = pd.get_store('big1.h5')
i = 0
for df in store.select('/MeasurementSamples', chunksize=100):
    i += 1
print(i)
store.close()

Result:

Traceback (most recent call last):
  File "memerror.py", line 6, in <module>
    for df in store.select(experiment, chunksize=100):
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 721, in select
    return it.get_result()
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 1409, in get_result
    self.coordinates = self.s.read_coordinates(where=self.where)
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 3652, in read_coordinates
    coords = self.selection.select_coords()
  File "/home/chwalisz/Code/ext_tools/pandas/pandas/io/pytables.py", line 4718, in select_coords
    return np.arange(start, stop)
MemoryError
Closing remaining open files:big1.h5...done

Problem description

I’m not able to iterate over the chunks of file when the index array is to big and cannot fit into memory. I can also mention that I’m able to view the data with ViTables (that use PyTables internally to load data).

I’m using more less following code to create file (writing to it long enough to have 20GB of data).

import tables as tb

class FreqSample(tb.IsDescription):
    tsf = tb.Int64Col(dflt=-1)  # [us] TSF value ticks in micro seconds
    timestamp = tb.Int64Col()  # [ns] Epoch time
    frequency = tb.Float64Col()
    power = tb.Float64Col()

h5filters = tb.Filters(complib='blosc', complevel=5)
h5file = tb.open_file(fname, mode="a",
           title=title,
           filters=h5filters)
tab = h5file.create_table('/Measurement', 'a',  FreqSample)

try:
    while True:
        row = tab.row
        row['tsf'] = 1
        row['timestamp'] = 2
        row['frequency'] = 3
        row['power'] = 4
        row.append()
except:
    pass
tab.autoindex = True
tab.flush()
h5file.close()

Expected Output

I would expect the above code prints number of chunks.

Output of pd.show_versions()

>>> pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.9.10-040910-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.0+739.g7b82e8b pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.5.4 patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: 3.3.0 numexpr: 2.6.2 feather: 0.3.1 matplotlib: 2.0.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.999 sqlalchemy: 1.1.8 pymysql: None psycopg2: None jinja2: 2.9.5 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:open
  • Created 6 years ago
  • Comments:17 (14 by maintainers)

github_iconTop GitHub Comments

3reactions
sajaddarabicommented, Jan 1, 2019

Are there any solutions to this?

0reactions
mchwaliszcommented, May 8, 2017

In this example yes. I expect it will be dependent on the amount of RAM.

In this case it will fail if number of rows * 8bytes per row (np.arange) is greater than systems RAM.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory error while reading a large .h5 file - Stack Overflow
In general, h5py dataset behavior is similar to numpy arrays (by design). However, they are not the same. When you tried to print...
Read more >
machine learning - storing a huge dataset in h5py file format
I work on preparing the luna16 dataset for feeding into the CNN model, after reading all '.mhd' files and the labels(0, 1) in...
Read more >
Can't Open Hdf5 File Bigger Than Memory... Valueerror
BUG : MemoryError on reading big HDF5 files #15937. Open if number of rows 8bytes per row np.arange is greater than systems RAM....
Read more >
How read extremely large HDF5 format data and resolve out ...
Learn more about hdf5, memory, large data MATLAB. ... I need to read several datasets from a 2TB HDF5 file which are needed...
Read more >
Tutorial on reading large datasets - Kaggle
Apart from methods of reading data from the raw csv files, ... doesn't work (running the below code should result in a memory...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found