question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

fetch_openml: Add an option to which returns a DataFrame

See original GitHub issue

fetch_openml currently rejects STRING-valued attributes and ordinal-encodes all NOMINAL attributes, in order to return an array or sparse matrix of floats by default.

We should have a parameter that instead returns a DataFrame of features as the ‘data’ entry in the returned Bunch. This would (by default) keep nominals as pd.Categorical and strings as objects. Columns would have names determined from the ARFF attribute names / OpenML metadata. Perhaps we would also set the DataFrame’s index corresponding to the is_row_identifier attribute in OpenML.

See #10733 for the general issue of an API for returning DataFrames in sklearn.datasets.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:17 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
amuellercommented, Nov 27, 2018

Btw, #12502 is somewhat related.

0reactions
rthcommented, Feb 4, 2019

Just to give some feedback on this as a user. Tried to load https://www.openml.org/d/1461 which is heterogeneous dataset with fetch_openml. It can be represented nicely with a DataFrame and can be read from the orignal csv with one line of pd.read_csv.

When using fetch_openml function, categorical features are encoded as ordinals and than cast to float (since we want an array). From my perspective on this dataset, this prevent one from doing anything useful with it. The python-openml package also doesn’t support loading data as DataFrame until https://github.com/openml/openml-python/pull/548 is merged.

In terms of usability of OpenML datasets returning DataFrames would be really nice.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.datasets.fetch_openml
sklearn.datasets.fetch_openml(name: Optional[str] = None, *, version: Union[str, ... The API is experimental (particularly the return value structure), ...
Read more >
Transform sklearn dataframe into Pandas ... - Stack Overflow
There is no option to not encode categorical features. ... as pd import numpy as np def main(): dataset = datasets.fetch_openml('credit-g', ...
Read more >
Python sklearn.datasets.fetch_openml() Examples
This page shows Python examples of sklearn.datasets.fetch_openml. ... 1) avg_ppmvs = np.asarray(ppmv_sums) / counts return months, avg_ppmvs. Example #5 ...
Read more >
Scikit-learn and data frames — sklearndf documentation
return data frames as results of transformations, preserving feature names as the column index. add additional estimator properties to enable tracing a feature ......
Read more >
Datasets — scikit-lego latest documentation
Datasets · return_X_y – If True, returns (data, target) instead of a dict object. · as_frame – give the pandas dataframe...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found