Efficient conversion from dataframes to numpy arrays retaining column dtypes and names
See original GitHub issueProblem description
I was looking into how to convert dataframes to numpy arrays so that both column dtypes and names would be retained, preferably in an efficient way so that memory is not duplicated while doing this. In some way, I would like to have a view on internal data already stored by dataframes as a numpy array. I am good with all datatypes already used in dataframe, and names there.
The issue is that both as_matrix
and values
convert dtypes of all values. And to_records
does not create a simple numpy array.
I have found two potential StackOverflow answers:
- https://stackoverflow.com/questions/40554179/how-to-keep-column-names-when-converting-from-pandas-to-numpy
- https://stackoverflow.com/questions/13187778/convert-pandas-dataframe-to-numpy-array-preserving-index
But it seems to me that all those solutions copy data around through intermediate data structures, and then just store them into a new numpy array.
So I would ask for a way to get data as it is, without any conversions of dtypes, as a numpy array.
Output of pd.show_versions()
pandas: 0.20.1 pytest: None pip: 9.0.1 setuptools: 20.7.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: None sphinx: None patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Reactions:1
- Comments:8 (7 by maintainers)
Top GitHub Comments
From what I recall
recarray
is very thin subclass, something like this probably works if you have a strict ndarray requirement downstream.@mitar
using multi-dtype ndarrays is only supported via rec-arrays (as @chris-b1 shows how to convert).
You certainly can select out columns or do a
.values
conversion. But the target function needs to potentially deal with anobject
dtype array. So this is not efficient at all. You need to segregate dtypes; it is simply a lot of work to do with numpy arrays. pandas does this with ease. So you can certainly use some of the pointed to solutions. But I suspect you have other issues if the conversion to an ndarray is your bottleneck.