IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer on non integer column.
See original GitHub issueI have dask bag with 59 n_partitions with chucksize of 100 000 ( so basically around 6 million records). I want to transform dask bag to dask dataframe and then to pandas dataframe. This is my snippet.
%%time
bag = dask_mongo.read_mongo(
database="XXXXX",
collection="XXXX",
connection_kwargs={"XXXXXXXX"},
chunksize=100000,
)
df = bag.to_dataframe()
df = df.astype('object')
df2 = df.compute()
I tried multiple things - cast every column to dtype object, remove na values with dropna(). Everytime i get this exception:
IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer
I understand that it needs to get rid of NaN objects, but i tried multiple ways and no success.
Compute to list
list = bag.compute()
works fust fine. But i really want it in dataframe so i can do analysis on those data later on
I even tried to take one column and cast it o float64 (as i read, pandas do not support NaN in integer data types, so a tried float64 and object dtype) :
Dask Series Structure:
npartitions=59
float64
...
...
...
...
Name: xxx, dtype: float64
Dask Name: astype, 236 tasks
but even there i get IntCastingNaNError …
After that, i tried to cast all columns to dtype object =
df = df.astype('object')
with no luck either. Same exception. Basically every column has this exception… Thanks…
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (6 by maintainers)
@Christiankoo not sure if this helps now, but another option to have the dictionary of the meta from the dataframe is, on the example above:
Then you can replace only what you need on that dictionary.
Glad it worked! Yeah that’s a great question. I don’t think Dask bag currently has any functionality to do this. One workaround might be to get the column names from your Dask DataFrame first to dynamically populate the meta argument, something like:
But maybe the others on this issue have a more elegant solution 😃. It might be nice to add a feature to Dask bag that’s analogous to returning columns from a Dask DataFrame, or even just being able to see a preview of the first partition.