Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HDFStore appending for mixed datatypes, including NumPy arrays

See original GitHub issue

A pandas array I have contains some image data, recorded from a camera during a behavioral experiment. A simplified version looks like this:

num_frames = 100
mouse = [{"velocity":np.random.random((1,))[0], \
        "image":np.random.random((80,80)).astype('float32'), \
        "spine":np.r_[0:80].astype('float32'),
        #"time":millisec(i*33),
        "mouse_id":"mouse1",
        "special":i} for i in range(num_frames)]
df = DataFrame(mouse)

I understand I can’t query over the image or spine entries. Of course, I can easily query for low velocity frames, like this:

low_velocity = df[df['velocity'] < 0.5]

However, there is a lot of this data (several hundred gigabytes), so I’d like to keep it in an HDF5 file, and pull up frames only as needed from disk.

In v0.10, I understand that “mixed-type” frames now can be appended into the HDFStore. However, I get an error when trying to append this dataframe into the HDFStore.

store = HDFStore("mouse.h5", "w")
store.append("mouse", df)

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-30-8f0da271e75f> in <module>()
      1 store = HDFStore("mouse.h5", "w")
----> 2 store.append("mouse", df)

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in append(self, key, value, columns, **kwargs)
    543             raise Exception("columns is not a supported keyword in append, try data_columns")
    544 
--> 545         self._write_to_group(key, value, table=True, append=True, **kwargs)
    546 
    547     def append_to_multiple(self, d, value, selector, data_columns=None, axes=None, **kwargs):

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in _write_to_group(self, key, value, index, table, append, complib, **kwargs)
    799             raise ValueError('Compression not supported on non-table')
    800 
--> 801         s.write(obj = value, append=append, complib=complib, **kwargs)
    802         if s.is_table and index:
    803             s.create_index(columns = index)

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, **kwargs)
   2537         # create the axes
   2538         self.create_axes(axes=axes, obj=obj, validate=append,
-> 2539                          min_itemsize=min_itemsize, **kwargs)
   2540 
   2541         if not self.is_exists:

/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas-0.11.0.dev_95a5326-py2.7-macosx-10.5-x86_64.egg/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2279                 raise
   2280             except (Exception), detail:
-> 2281                 raise Exception("cannot find the correct atom type -> [dtype->%s,items->%s] %s" % (b.dtype.name, b.items, str(detail)))
   2282             j += 1
   2283 

Exception: cannot find the correct atom type -> [dtype->object,items->Index([image, mouse_id, spine], dtype=object)] cannot set an array element with a sequence

I’m working with a relatively new release of pandas:

pandas.__version__
'0.11.0.dev-95a5326'

import tables
tables.__version__
'2.4.0+1.dev'

It would be immensely convenient to have a single repository for all of this data, instead of fragmenting just the queryable parts off to separate nodes. Is this possible currently with some work-around (maybe with record arrays), and will this be supported officially in the future?

As a side-note, this kind of heterogeneous data (“ragged” arrays) is incredibly wide-spread in neurobiology and the biological sciences in general. Any extra support along these lines would be incredibly well-received.

Issue Analytics

State:
Created 11 years ago
Comments:20 (10 by maintainers)

Top GitHub Comments

1reaction

jrebackcommented, Mar 12, 2013

just wrap it with a DataFrame (its a 2d ndarray) store.put('df/image',DataFrame(image) )

1-d use a Series 3-d use a Panel 4-d use Panel4D gt 4dims call me in the morning!

In [42]: df.iloc[0]
Out[42]: 
image       [[0.60904, 0.0175226, 0.36146, 0.947978, 0.327...
mouse_id                                               mouse1
special                                                     0
spine       [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, ...
velocity                                            0.9005659
Name: 0, dtype: object

In [43]: df.iloc[0]['image'].shape
Out[43]: (80, 80)

0reactions

jrebackcommented, Mar 20, 2013

@alexbw close this?