Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Types of a fake partition in map_partition inferring are not exactly correct

See original GitHub issue

Consider the following code:

import dask.dataframe
import pandas

def qq(partition_df):
    print(partition_df)
    print(partition_df['X'].dtype)
    return partition_df

df = dask.dataframe.from_pandas(pandas.DataFrame({'X': range(100)}, dtype='int32'), npartitions=1)
print(df['X'].dtype)
df.map_partitions(qq).compute()

The output is:

float32
float64
float32

while I expect to see:

float32
float32
float32

Can the types be just copied from the original dataframe? How can I distinguish fake partitions from true ones? I have a heavy computation which I would rather avoid on fake partitions. In Dask 0.10 fake partition was empty, but Dask 0.11 passes a partition with two records in it.

Issue Analytics

State:
Created 7 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

jcristcommented, Aug 29, 2016

The bug causing numeric dtype size inference issues is fixed in #1513.

In [1]: %load test
test.h5  test.py

In [1]: %load temp.py

In [2]: # %load temp.py
import dask.dataframe
import pandas

def qq(partition_df):
    print(partition_df['X'].dtype)
    return partition_df

df = dask.dataframe.from_pandas(pandas.DataFrame({'X': range(100)}, dtype='int32'), npartitions=3)
res = df.map_partitions(qq)
   ...:
int32

In [3]: res.dtypes
Out[3]:
X    int32
dtype: object

In [4]: _ = res.compute()
int32
int32
int32

Note however that dtype inference in pandas is tricky due to all the implicit casting that pandas will do internally. For most simple user functions we should be able to guess dtype correctly, but there will be cases where we fail, and you should provide the meta keyword (see docstring here: http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions). I’ll write up a section for the docs in a separate PR.

0reactions

frolcommented, Sep 3, 2016

~~I have found that I can distinguish fake partitions by looking at partition_df.is_copy value, which is None for fake partitions and weakref for true partitions!~~ [For some reason it is not always true.]

Top Results From Across the Web

pySpark convert result of mapPartitions to spark DataFrame

If you want to stay with rdd api. mapPartitions accepts an iterator of a type and expects an iterator of another type as...

pyspark.rdd — PySpark master documentation - Apache Spark

RDD transformations and actions can only be invoked by the " "driver, not inside of ... [docs] def mapPartitions(self, f, preservesPartitioning=False): ...

PySpark mapPartitions() Examples

mapPartitions () applies a heavy initialization to each partition of RDD instead of each element of RDD. It is a Narrow transformation operation;...

DataFrame.map_partitions - Dask documentation

If False, all inputs must have either the same number of partitions or a single partition. ... If not provided, dask will try...

Statistical inference of generative network models - Graph-tool

Vertex property map with group partition. If not provided, the state's partition will be used. unlabelbool (optional, default: False ).