Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unpredictable TypeError when computing on dataframe with datetime column

See original GitHub issue

This is a very strange error that I cannot provide lots of information for. I have tried various conversions of the time columns (e.g. to int64) but the result does not change.

What happened: TypeError is thrown when applying sort_values to a very simple dataframe of datetime objects. Very minor changes in the data make the error disappear or reappear (single digits, removing any row in the dataframe, or changing partition number).

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd

# Completes without exception
dts = pd.to_datetime(
['2022-04-23 00:00:00.618000+00:00',
'2022-04-23 00:00:03.199000+00:00',
'2022-04-23 00:00:03.463000+00:00',
'2022-04-23 00:00:02.396000+00:00',
'2022-04-23 00:00:02.623000+00:01',]  # <-- Only difference to the next block is the last digit here
)
df = pd.DataFrame(dict(ts=dts))
ddf = dd.from_pandas(df, npartitions=2)
ddf.sort_values(by='ts').compute()         


# Throws TypeError
dts = pd.to_datetime(
['2022-04-23 00:00:00.618000+00:00',
'2022-04-23 00:00:03.199000+00:00',
'2022-04-23 00:00:03.463000+00:00',
'2022-04-23 00:00:02.396000+00:00',
'2022-04-23 00:00:02.623000+00:00',]
)
df = pd.DataFrame(dict(ts=dts))
ddf = dd.from_pandas(df, npartitions=2)
ddf.sort_values(by='ts').compute()            # TypeError: value should be a 'Timestamp', 'NaT', or array of those. Got 'StringArray' instead.

Anything else we need to know?: The error originally occurred with a large dataset and many partitions, so it is not related to the small data size in the example.

Environment:

Dask version: 2022.04.1
Python version: 3.9.1
Operating System: Debian GNU/Linux 11 (bullseye)
Install method (conda, pip, source): pip

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:8 (3 by maintainers)

Top GitHub Comments

2reactions

JnsLnscommented, May 6, 2022

@park I have found a workaround for the issue, which involves converting the datetime column to integer and doing the sort (or other operations) on the converted column:

MCVE

import dask.dataframe 
import pandas as pd

dts = pd.to_datetime(
['2022-04-23 00:00:00.618000+00:00',
'2022-04-23 00:00:03.199000+00:00',
'2022-04-23 00:00:03.463000+00:00',
'2022-04-23 00:00:02.396000+00:00',
'2022-04-23 00:00:02.623100+00:00',]  
)
df = pd.DataFrame(dict(ts=dts))

df['backup'] = df['ts']
df = dask.dataframe.from_pandas(df, npartitions=2)    
df['ts'] = df['ts'].values.astype('int64')       # without this -> TypeError
df = df.sort_values(by='ts').compute()
df['ts'] = df['backup']
df = df.drop(columns='backup')

Hope it is applicable to your use case.

2reactions

pavithraescommented, May 2, 2022

@JnsLns Thanks for reporting this!

Looks like the first DataFrame has object types, while the second has datetime64[ns, UTC]. I think this error is related to compatibility with custom pandas dtypes (there are a few open issues around this). I’ll keep looking into it. 😃