Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Efficient conversion from dataframes to numpy arrays retaining column dtypes and names

See original GitHub issue

Problem description

I was looking into how to convert dataframes to numpy arrays so that both column dtypes and names would be retained, preferably in an efficient way so that memory is not duplicated while doing this. In some way, I would like to have a view on internal data already stored by dataframes as a numpy array. I am good with all datatypes already used in dataframe, and names there.

The issue is that both as_matrix and values convert dtypes of all values. And to_records does not create a simple numpy array.

I have found two potential StackOverflow answers:

But it seems to me that all those solutions copy data around through intermediate data structures, and then just store them into a new numpy array.

So I would ask for a way to get data as it is, without any conversions of dtypes, as a numpy array.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.9.27-moby machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.1 pytest: None pip: 9.0.1 setuptools: 20.7.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 6 years ago
Reactions:1
Comments:8 (7 by maintainers)

Top GitHub Comments

9reactions

chris-b1commented, May 31, 2017

From what I recall recarray is very thin subclass, something like this probably works if you have a strict ndarray requirement downstream.

In [14]: ra = frame.to_records(index=False)

In [15]: np.asarray(ra)
Out[15]: 
array([(1, 0, 3.4), (1, 0, 3.4), (2, 1, 4.5)], 
      dtype=(numpy.record, [('col1', '<i4'), ('col2', 'i1'), ('col3', '<f8')]))

4reactions

jrebackcommented, May 31, 2017

@mitar

using multi-dtype ndarrays is only supported via rec-arrays (as @chris-b1 shows how to convert).

You certainly can select out columns or do a .values conversion. But the target function needs to potentially deal with an object dtype array. So this is not efficient at all. You need to segregate dtypes; it is simply a lot of work to do with numpy arrays. pandas does this with ease. So you can certainly use some of the pointed to solutions. But I suspect you have other issues if the conversion to an ndarray is your bottleneck.