`unique()` has wrong return type
See original GitHub issuecc https://github.com/databricks/koalas/issues/233
Not sure why https://github.com/databricks/koalas/pull/249 decided to return a Series
instead of a numpy
array… ?
>>> import findspark
>>> findspark.init()
>>> import databricks.koalas as ks
>>> import pandas as pd
>>> findspark.__version__
'1.3.0'
>>> pd.__version__
'0.24.2'
>>> ks.__version__
'0.12.0'
>>> pdf = pd.DataFrame({'a': [1, 2]})
>>> kdf = ks.DataFrame(pdf)
... (spark output removed) ...
>>> kdf['a'].unique()
0 1
1 2
Name: a, dtype: int64
>>> pdf['a'].unique()
array([1, 2])
>>> list(pdf['a'].unique())
[1, 2]
>>> list(kdf['a'].unique())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 2407, in __getitem__
return Series(self._scol.__getitem__(key), anchor=self._kdf, index=self._index_map)
File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 273, in __init__
data=data, index=index, dtype=dtype, name=name, copy=copy, fastpath=fastpath)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/series.py", line 198, in __init__
elif isinstance(data, (ABCSeries, ABCSparseSeries)):
File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/generic.py", line 9, in _check
return getattr(inst, attr, '_typ') in comp
File "/opt/spark/python/pyspark/sql/column.py", line 682, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>>> a = []
>>> a.extend(kdf['a'].unique())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 2407, in __getitem__
return Series(self._scol.__getitem__(key), anchor=self._kdf, index=self._index_map)
File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 273, in __init__
data=data, index=index, dtype=dtype, name=name, copy=copy, fastpath=fastpath)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/series.py", line 198, in __init__
elif isinstance(data, (ABCSeries, ABCSparseSeries)):
File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/generic.py", line 9, in _check
return getattr(inst, attr, '_typ') in comp
File "/opt/spark/python/pyspark/sql/column.py", line 682, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>>> import numpy as np
>>> np.__version__
'1.15.2'
>>> np.array(pdf['a'].unique())
array([1, 2])
>>> np.array(kdf['a'].unique())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 2407, in __getitem__
return Series(self._scol.__getitem__(key), anchor=self._kdf, index=self._index_map)
File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 273, in __init__
data=data, index=index, dtype=dtype, name=name, copy=copy, fastpath=fastpath)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/series.py", line 198, in __init__
elif isinstance(data, (ABCSeries, ABCSparseSeries)):
File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/generic.py", line 9, in _check
return getattr(inst, attr, '_typ') in comp
File "/opt/spark/python/pyspark/sql/column.py", line 682, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (7 by maintainers)
Top Results From Across the Web
Task has a wrong return type - Stack Overflow
No Task in your Button Click event. Just async . It's an event handler. Returns nothing. – Jimi. Aug 26, 2018 at 9:07....
Read more >BUG: pandas.Series.unique() does not return correct ... - GitHub
unique() does not return correct unique values when working with non-standard strings. In above example data_list clearly has 5 unique elements ...
Read more >SELECT() - AppSheet Help - Google Support
Returns a list of values from the column of selected rows in the data set. Sample usage. SELECT(Students[First Name], TRUE, FALSE) returns a...
Read more >pandas.Series — pandas 1.5.2 documentation
Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host...
Read more >Google Visualization API Reference | Charts
Each column has a descriptor that includes its data type, a label for that column (which might be ... clone(), DataTable, Returns a...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
One addition: we should at the very least improve the error message, so people will find out about to_numpy without reading all the docs.
Thanks for the comment @smalory.
@ueshin can we fix the error message?