Unpredictable TypeError when computing on dataframe with datetime column
See original GitHub issueThis is a very strange error that I cannot provide lots of information for. I have tried various conversions of the time columns (e.g. to int64) but the result does not change.
What happened: TypeError
is thrown when applying sort_values
to a very simple dataframe of datetime objects. Very minor changes in the data make the error disappear or reappear (single digits, removing any row in the dataframe, or changing partition number).
Minimal Complete Verifiable Example:
import pandas as pd
import dask.dataframe as dd
# Completes without exception
dts = pd.to_datetime(
['2022-04-23 00:00:00.618000+00:00',
'2022-04-23 00:00:03.199000+00:00',
'2022-04-23 00:00:03.463000+00:00',
'2022-04-23 00:00:02.396000+00:00',
'2022-04-23 00:00:02.623000+00:01',] # <-- Only difference to the next block is the last digit here
)
df = pd.DataFrame(dict(ts=dts))
ddf = dd.from_pandas(df, npartitions=2)
ddf.sort_values(by='ts').compute()
# Throws TypeError
dts = pd.to_datetime(
['2022-04-23 00:00:00.618000+00:00',
'2022-04-23 00:00:03.199000+00:00',
'2022-04-23 00:00:03.463000+00:00',
'2022-04-23 00:00:02.396000+00:00',
'2022-04-23 00:00:02.623000+00:00',]
)
df = pd.DataFrame(dict(ts=dts))
ddf = dd.from_pandas(df, npartitions=2)
ddf.sort_values(by='ts').compute() # TypeError: value should be a 'Timestamp', 'NaT', or array of those. Got 'StringArray' instead.
Anything else we need to know?: The error originally occurred with a large dataset and many partitions, so it is not related to the small data size in the example.
Environment:
- Dask version: 2022.04.1
- Python version: 3.9.1
- Operating System: Debian GNU/Linux 11 (bullseye)
- Install method (conda, pip, source): pip
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:8 (3 by maintainers)
Top Results From Across the Web
dask dataframe how to convert column to to_datetime
Use astype. You can use the astype method to convert the dtype of a series to a NumPy dtype df.time.astype('M8[us]').
Read more >KeyError Pandas – How To Fix - Data Independent
Pandas KeyError - This annoying error means that Pandas can not find your column name in your dataframe. Here's how to fix this...
Read more >pandas.to_datetime — pandas 0.21.1 documentation
Convert argument to datetime. Specify a date parse order if arg is str or its list-likes. If True, parses dates with the day...
Read more >dask.dataframe.to_datetime - Dask documentation
When another datetime conversion error happens. For example when one of 'year', 'month', day' columns is missing in a DataFrame , or when...
Read more >Introduction to Pandas - Pythia Foundations
Similar to any other column , the index can label our rows by text, numbers, datetime s (a popular one!), or more. Let's...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@park I have found a workaround for the issue, which involves converting the datetime column to integer and doing the sort (or other operations) on the converted column:
MCVE
Hope it is applicable to your use case.
@JnsLns Thanks for reporting this!
Looks like the first DataFrame has
object
types, while the second hasdatetime64[ns, UTC]
. I think this error is related to compatibility with custom pandas dtypes (there are a few open issues around this). I’ll keep looking into it. 😃