question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unpredictable TypeError when computing on dataframe with datetime column

See original GitHub issue

This is a very strange error that I cannot provide lots of information for. I have tried various conversions of the time columns (e.g. to int64) but the result does not change.

What happened: TypeError is thrown when applying sort_values to a very simple dataframe of datetime objects. Very minor changes in the data make the error disappear or reappear (single digits, removing any row in the dataframe, or changing partition number).

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd

# Completes without exception
dts = pd.to_datetime(
['2022-04-23 00:00:00.618000+00:00',
'2022-04-23 00:00:03.199000+00:00',
'2022-04-23 00:00:03.463000+00:00',
'2022-04-23 00:00:02.396000+00:00',
'2022-04-23 00:00:02.623000+00:01',]  # <-- Only difference to the next block is the last digit here
)
df = pd.DataFrame(dict(ts=dts))
ddf = dd.from_pandas(df, npartitions=2)
ddf.sort_values(by='ts').compute()         


# Throws TypeError
dts = pd.to_datetime(
['2022-04-23 00:00:00.618000+00:00',
'2022-04-23 00:00:03.199000+00:00',
'2022-04-23 00:00:03.463000+00:00',
'2022-04-23 00:00:02.396000+00:00',
'2022-04-23 00:00:02.623000+00:00',]
)
df = pd.DataFrame(dict(ts=dts))
ddf = dd.from_pandas(df, npartitions=2)
ddf.sort_values(by='ts').compute()            # TypeError: value should be a 'Timestamp', 'NaT', or array of those. Got 'StringArray' instead.

Anything else we need to know?: The error originally occurred with a large dataset and many partitions, so it is not related to the small data size in the example.

Environment:

  • Dask version: 2022.04.1
  • Python version: 3.9.1
  • Operating System: Debian GNU/Linux 11 (bullseye)
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
JnsLnscommented, May 6, 2022

@park I have found a workaround for the issue, which involves converting the datetime column to integer and doing the sort (or other operations) on the converted column:

MCVE

import dask.dataframe 
import pandas as pd

dts = pd.to_datetime(
['2022-04-23 00:00:00.618000+00:00',
'2022-04-23 00:00:03.199000+00:00',
'2022-04-23 00:00:03.463000+00:00',
'2022-04-23 00:00:02.396000+00:00',
'2022-04-23 00:00:02.623100+00:00',]  
)
df = pd.DataFrame(dict(ts=dts))

df['backup'] = df['ts']
df = dask.dataframe.from_pandas(df, npartitions=2)    
df['ts'] = df['ts'].values.astype('int64')       # without this -> TypeError
df = df.sort_values(by='ts').compute()
df['ts'] = df['backup']
df = df.drop(columns='backup')    

Hope it is applicable to your use case.

2reactions
pavithraescommented, May 2, 2022

@JnsLns Thanks for reporting this!

Looks like the first DataFrame has object types, while the second has datetime64[ns, UTC]. I think this error is related to compatibility with custom pandas dtypes (there are a few open issues around this). I’ll keep looking into it. 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask dataframe how to convert column to to_datetime
Use astype. You can use the astype method to convert the dtype of a series to a NumPy dtype df.time.astype('M8[us]').
Read more >
KeyError Pandas – How To Fix - Data Independent
Pandas KeyError - This annoying error means that Pandas can not find your column name in your dataframe. Here's how to fix this...
Read more >
pandas.to_datetime — pandas 0.21.1 documentation
Convert argument to datetime. Specify a date parse order if arg is str or its list-likes. If True, parses dates with the day...
Read more >
dask.dataframe.to_datetime - Dask documentation
When another datetime conversion error happens. For example when one of 'year', 'month', day' columns is missing in a DataFrame , or when...
Read more >
Introduction to Pandas - Pythia Foundations
Similar to any other column , the index can label our rows by text, numbers, datetime s (a popular one!), or more. Let's...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found