Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fetch_openml: Add an option to which returns a DataFrame

See original GitHub issue

fetch_openml currently rejects STRING-valued attributes and ordinal-encodes all NOMINAL attributes, in order to return an array or sparse matrix of floats by default.

We should have a parameter that instead returns a DataFrame of features as the ‘data’ entry in the returned Bunch. This would (by default) keep nominals as pd.Categorical and strings as objects. Columns would have names determined from the ARFF attribute names / OpenML metadata. Perhaps we would also set the DataFrame’s index corresponding to the is_row_identifier attribute in OpenML.

See #10733 for the general issue of an API for returning DataFrames in sklearn.datasets.

Issue Analytics

State:
Created 5 years ago
Comments:17 (11 by maintainers)

Top GitHub Comments

1reaction

amuellercommented, Nov 27, 2018

Btw, #12502 is somewhat related.

0reactions

rthcommented, Feb 4, 2019

Just to give some feedback on this as a user. Tried to load https://www.openml.org/d/1461 which is heterogeneous dataset with fetch_openml. It can be represented nicely with a DataFrame and can be read from the orignal csv with one line of pd.read_csv.

When using fetch_openml function, categorical features are encoded as ordinals and than cast to float (since we want an array). From my perspective on this dataset, this prevent one from doing anything useful with it. The python-openml package also doesn’t support loading data as DataFrame until https://github.com/openml/openml-python/pull/548 is merged.

In terms of usability of OpenML datasets returning DataFrames would be really nice.

Top Results From Across the Web

sklearn.datasets.fetch_openml

sklearn.datasets.fetch_openml(name: Optional[str] = None, *, version: Union[str, ... The API is experimental (particularly the return value structure), ...

Transform sklearn dataframe into Pandas ... - Stack Overflow

There is no option to not encode categorical features. ... as pd import numpy as np def main(): dataset = datasets.fetch_openml('credit-g', ...

Python sklearn.datasets.fetch_openml() Examples

This page shows Python examples of sklearn.datasets.fetch_openml. ... 1) avg_ppmvs = np.asarray(ppmv_sums) / counts return months, avg_ppmvs. Example #5 ...

Scikit-learn and data frames — sklearndf documentation

return data frames as results of transformations, preserving feature names as the column index. add additional estimator properties to enable tracing a feature ......

Datasets — scikit-lego latest documentation

Datasets · return_X_y – If True, returns (data, target) instead of a dict object. · as_frame – give the pandas dataframe...