question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HDFStore appending for mixed datatypes, including NumPy arrays

See original GitHub issue

A pandas array I have contains some image data, recorded from a camera during a behavioral experiment. A simplified version looks like this:

num_frames = 100
mouse = [{"velocity":np.random.random((1,))[0], \
        "image":np.random.random((80,80)).astype('float32'), \
        "spine":np.r_[0:80].astype('float32'),
        #"time":millisec(i*33),
        "mouse_id":"mouse1",
        "special":i} for i in range(num_frames)]
df = DataFrame(mouse)

I understand I can’t query over the image or spine entries. Of course, I can easily query for low velocity frames, like this:

low_velocity = df[df['velocity'] < 0.5]

However, there is a lot of this data (several hundred gigabytes), so I’d like to keep it in an HDF5 file, and pull up frames only as needed from disk.

In v0.10, I understand that “mixed-type” frames now can be appended into the HDFStore. However, I get an error when trying to append this dataframe into the HDFStore.

store = HDFStore("mouse.h5", "w")
store.append("mouse", df)

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-30-8f0da271e75f> in <module>()
      1 store = HDFStore("mouse.h5", "w")
----> 2 store.append("mouse", df)

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in append(self, key, value, columns, **kwargs)
    543             raise Exception("columns is not a supported keyword in append, try data_columns")
    544 
--> 545         self._write_to_group(key, value, table=True, append=True, **kwargs)
    546 
    547     def append_to_multiple(self, d, value, selector, data_columns=None, axes=None, **kwargs):

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in _write_to_group(self, key, value, index, table, append, complib, **kwargs)
    799             raise ValueError('Compression not supported on non-table')
    800 
--> 801         s.write(obj = value, append=append, complib=complib, **kwargs)
    802         if s.is_table and index:
    803             s.create_index(columns = index)

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, **kwargs)
   2537         # create the axes
   2538         self.create_axes(axes=axes, obj=obj, validate=append,
-> 2539                          min_itemsize=min_itemsize, **kwargs)
   2540 
   2541         if not self.is_exists:

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2279                 raise
   2280             except (Exception), detail:
-> 2281                 raise Exception("cannot find the correct atom type -> [dtype->%s,items->%s] %s" % (b.dtype.name, b.items, str(detail)))
   2282             j += 1
   2283 

Exception: cannot find the correct atom type -> [dtype->object,items->Index([image, mouse_id, spine], dtype=object)] cannot set an array element with a sequence

I’m working with a relatively new release of pandas:

pandas.__version__
'0.11.0.dev-95a5326'

import tables
tables.__version__
'2.4.0+1.dev'

It would be immensely convenient to have a single repository for all of this data, instead of fragmenting just the queryable parts off to separate nodes. Is this possible currently with some work-around (maybe with record arrays), and will this be supported officially in the future?

As a side-note, this kind of heterogeneous data (“ragged” arrays) is incredibly wide-spread in neurobiology and the biological sciences in general. Any extra support along these lines would be incredibly well-received.

Issue Analytics

  • State:closed
  • Created 11 years ago
  • Comments:20 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
jrebackcommented, Mar 12, 2013

just wrap it with a DataFrame (its a 2d ndarray) store.put('df/image',DataFrame(image) )

1-d use a Series 3-d use a Panel 4-d use Panel4D gt 4dims call me in the morning!

In [42]: df.iloc[0]
Out[42]: 
image       [[0.60904, 0.0175226, 0.36146, 0.947978, 0.327...
mouse_id                                               mouse1
special                                                     0
spine       [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, ...
velocity                                            0.9005659
Name: 0, dtype: object

In [43]: df.iloc[0]['image'].shape
Out[43]: (80, 80)
0reactions
jrebackcommented, Mar 20, 2013

@alexbw close this?

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to create and append to a hdf5 table using pandas with ...
It has save the 'table' as a numpy structured array. ... the same array, which it has stored as an object dtype array,...
Read more >
pandas.HDFStore.append — pandas 1.5.2 documentation
Write DataFrame index as a column. Append the input data to the existing. List of columns to create as indexed data columns for...
Read more >
Structured arrays — NumPy v1.24 Manual
Structured arrays are ndarrays whose datatype is a composition of simpler datatypes ... The datatype of a field may be any numpy datatype...
Read more >
[Code]-Pandas dataframe in mixed mode can't serialize to hdf5?
Coding example for the question Pandas dataframe in mixed mode can't serialize to ... It seems that HDFStore would not be very useful...
Read more >
IO tools (text, CSV, HDF5, …) - Pandas 中文
will result with mixed_df containing an int dtype for certain chunks of the column, and str for others due to the mixed dtypes...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found