question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

set_index on datetime column with microsecond precision removes one row from dataframe

See original GitHub issue

What happened:

I created a pandas dataframe with shape (4, 2). One of the column is a datetime with microsecond precision.

If I create a dask dataframe from this pandas dataframe and call set_index on the datetime column, the resulting dataframe now has only 3 rows: image

What you expected to happen:

If I comment out the line where we convert the index column to a datetime, then I get a dataframe with 4 rows: image

I expected the original code snippet to produce the same output (i.e. 4 row dataframe), but with the index as a datetime instead of a number of microseconds since 1970-01-01.

As a workaround, if I convert the column to datetime after setting the index, then I get what I really want: image

Minimal Complete Verifiable Example:

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame(
    [
        [1567703791155681, 1],
        [1567703792155681, 2],
        [1567703790155681, 0],
        [1567703793155681, 3],
    ],
    columns=["ts", "rank"]
)
df.ts = pd.to_datetime(df.ts, unit='us')  # comment this line to get a df with 4 lines

ddf = dd.from_pandas(df, npartitions=2)

ddf = ddf.set_index("ts")

ddf.compute()

Anything else we need to know?: I suspect the problem comes from the fact that the divisions of the dask dataframe after set_index are as follows: image

Notice that the divisions are precise to the nanosecond and that the first division is bigger than the timestamp of the row we’ve lost by 24 nanoseconds… Looks like numerical imprecision microseconds-to-datetime conversion maybe?

Environment:

  • Dask version: 2.30.0
  • Python version: 3.8.6
  • Operating System: Linux (FROM python:3.8.6-buster in Dockerfile)
  • Install method (conda, pip, source): pip

n.b. I ran the above snippet in a jupyter notebook.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
rjzamoracommented, Nov 20, 2020

However, I am pretty sure (but not 100% certain) that I have also lost data (on my real 399 million rows dataframe) based on numerical imprecision of the last division… same problem with last and first divisions?

Yeah - I wouldn’t be surprised if there were knock-on effects that I am not thinking of. I don’t feel confident that correcting the first division will cover all cases, but my intuition tells me that error in the last division isn’t likely to cause a problem. Note that any element falling beyond the threshold of the last division will be reassigned to the last partition anyway.

0reactions
dtourilloncommented, Nov 20, 2020

Alright, hopefully fixing the first division fixes the fact that the overall index is not sorted in the end… You certainly know better than I do! 😉

However, I am pretty sure (but not 100% certain) that I have also lost data (on my real 399 million rows dataframe) based on numerical imprecision of the last division… same problem with last and first divisions?

Oh, and thanks for such a fast analysis and answer to my issue… greatly appreciated!

Read more comments on GitHub >

github_iconTop Results From Across the Web

python - pandas dataFrame with datetime64[ns] as Index with ...
I have a pandas dataFrame with datetime64[ns] as Index. Since the data is of microsecond precision, I want to copy this date to...
Read more >
pandas.DatetimeIndex — pandas 1.5.2 documentation
One of pandas date offset strings or corresponding objects. The string 'infer' can be passed in order to set the ... The microseconds...
Read more >
Indexing time series data in pandas - wrighters.io
This KeyError is raised because in a DataFrame , using a single argument to the [] operator will look for a column, not...
Read more >
Pandas for time series data — tricks and tips - Adrian G
Replace rows in dataframe with rows from another dataframe with same index. #for example first I created a new dataframe based on a...
Read more >
Package 'data.table'
When i is a list (or data.frame or data.table) and multiple rows in x ... a column type because it uses 40 bytes...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found