Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer on non integer column.

See original GitHub issue

I have dask bag with 59 n_partitions with chucksize of 100 000 ( so basically around 6 million records). I want to transform dask bag to dask dataframe and then to pandas dataframe. This is my snippet.

%%time
bag = dask_mongo.read_mongo(
    database="XXXXX",
    collection="XXXX",
    connection_kwargs={"XXXXXXXX"},
    chunksize=100000,
)
df = bag.to_dataframe()
df = df.astype('object')
df2 = df.compute()

I tried multiple things - cast every column to dtype object, remove na values with dropna(). Everytime i get this exception:

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

I understand that it needs to get rid of NaN objects, but i tried multiple ways and no success.

Compute to list

list = bag.compute()

works fust fine. But i really want it in dataframe so i can do analysis on those data later on

I even tried to take one column and cast it o float64 (as i read, pandas do not support NaN in integer data types, so a tried float64 and object dtype) :

Dask Series Structure:
npartitions=59
    float64
        ...
     ...   
        ...
        ...
Name: xxx, dtype: float64
Dask Name: astype, 236 tasks

but even there i get IntCastingNaNError …

After that, i tried to cast all columns to dtype object =

df = df.astype('object')

with no luck either. Same exception. Basically every column has this exception… Thanks…

Issue Analytics

State:
Created 2 years ago
Comments:10 (6 by maintainers)

Top GitHub Comments

1reaction

ncclementicommented, Dec 13, 2021

@Christiankoo not sure if this helps now, but another option to have the dictionary of the meta from the dataframe is, on the example above:

>>>  ddf._meta.dtypes.to_dict()
{'a': dtype('int64'), 'b': dtype('int64')}

Then you can replace only what you need on that dictionary.

1reaction

scharlottej13commented, Dec 10, 2021

Glad it worked! Yeah that’s a great question. I don’t think Dask bag currently has any functionality to do this. One workaround might be to get the column names from your Dask DataFrame first to dynamically populate the meta argument, something like:

import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask.bag as db

b = db.from_sequence([{'a': 1,   'b': 1},
                     {'a': np.inf,   'b': np.nan},
                    {'a': 1,   'b': np.nan}],
                    npartitions=2)
ddf = b.to_dataframe()
my_cols = list(ddf.columns)
meta_dict = dict(zip(my_cols, [object]*len(my_cols)))
ddf = b.to_dataframe(meta_dict)
ddf.compute()

But maybe the others on this issue have a more elegant solution 😃. It might be nice to add a feature to Dask bag that’s analogous to returning columns from a Dask DataFrame, or even just being able to see a preview of the first partition.