fetch_openml: Add an option to which returns a DataFrame
See original GitHub issuefetch_openml currently rejects STRING-valued attributes and ordinal-encodes all NOMINAL attributes, in order to return an array or sparse matrix of floats by default.
We should have a parameter that instead returns a DataFrame of features as the ‘data’ entry in the returned Bunch. This would (by default) keep nominals as pd.Categorical
and strings as objects. Columns would have names determined from the ARFF attribute names / OpenML metadata. Perhaps we would also set the DataFrame’s index corresponding to the is_row_identifier
attribute in OpenML.
See #10733 for the general issue of an API for returning DataFrames in sklearn.datasets
.
Issue Analytics
- State:
- Created 5 years ago
- Comments:17 (11 by maintainers)
Top Results From Across the Web
sklearn.datasets.fetch_openml
sklearn.datasets.fetch_openml(name: Optional[str] = None, *, version: Union[str, ... The API is experimental (particularly the return value structure), ...
Read more >Transform sklearn dataframe into Pandas ... - Stack Overflow
There is no option to not encode categorical features. ... as pd import numpy as np def main(): dataset = datasets.fetch_openml('credit-g', ...
Read more >Python sklearn.datasets.fetch_openml() Examples
This page shows Python examples of sklearn.datasets.fetch_openml. ... 1) avg_ppmvs = np.asarray(ppmv_sums) / counts return months, avg_ppmvs. Example #5 ...
Read more >Scikit-learn and data frames — sklearndf documentation
return data frames as results of transformations, preserving feature names as the column index. add additional estimator properties to enable tracing a feature ......
Read more >Datasets — scikit-lego latest documentation
Datasets · return_X_y – If True, returns (data, target) instead of a dict object. · as_frame – give the pandas dataframe...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Btw, #12502 is somewhat related.
Just to give some feedback on this as a user. Tried to load https://www.openml.org/d/1461 which is heterogeneous dataset with
fetch_openml
. It can be represented nicely with a DataFrame and can be read from the orignalcsv
with one line ofpd.read_csv
.When using
fetch_openml
function, categorical features are encoded as ordinals and than cast to float (since we want an array). From my perspective on this dataset, this prevent one from doing anything useful with it. The python-openml package also doesn’t support loading data as DataFrame until https://github.com/openml/openml-python/pull/548 is merged.In terms of usability of OpenML datasets returning DataFrames would be really nice.