question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Types of a fake partition in map_partition inferring are not exactly correct

See original GitHub issue

Consider the following code:

import dask.dataframe
import pandas

def qq(partition_df):
    print(partition_df)
    print(partition_df['X'].dtype)
    return partition_df

df = dask.dataframe.from_pandas(pandas.DataFrame({'X': range(100)}, dtype='int32'), npartitions=1)
print(df['X'].dtype)
df.map_partitions(qq).compute()

The output is:

float32
float64
float32

while I expect to see:

float32
float32
float32

Can the types be just copied from the original dataframe? How can I distinguish fake partitions from true ones? I have a heavy computation which I would rather avoid on fake partitions. In Dask 0.10 fake partition was empty, but Dask 0.11 passes a partition with two records in it.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
jcristcommented, Aug 29, 2016

The bug causing numeric dtype size inference issues is fixed in #1513.

In [1]: %load test
test.h5  test.py

In [1]: %load temp.py

In [2]: # %load temp.py
import dask.dataframe
import pandas

def qq(partition_df):
    print(partition_df['X'].dtype)
    return partition_df

df = dask.dataframe.from_pandas(pandas.DataFrame({'X': range(100)}, dtype='int32'), npartitions=3)
res = df.map_partitions(qq)
   ...:
int32

In [3]: res.dtypes
Out[3]:
X    int32
dtype: object

In [4]: _ = res.compute()
int32
int32
int32

Note however that dtype inference in pandas is tricky due to all the implicit casting that pandas will do internally. For most simple user functions we should be able to guess dtype correctly, but there will be cases where we fail, and you should provide the meta keyword (see docstring here: http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions). I’ll write up a section for the docs in a separate PR.

0reactions
frolcommented, Sep 3, 2016

I have found that I can distinguish fake partitions by looking at partition_df.is_copy value, which is None for fake partitions and weakref for true partitions! [For some reason it is not always true.]

Read more comments on GitHub >

github_iconTop Results From Across the Web

pySpark convert result of mapPartitions to spark DataFrame
If you want to stay with rdd api. mapPartitions accepts an iterator of a type and expects an iterator of another type as...
Read more >
pyspark.rdd — PySpark master documentation - Apache Spark
RDD transformations and actions can only be invoked by the " "driver, not inside of ... [docs] def mapPartitions(self, f, preservesPartitioning=False): ...
Read more >
PySpark mapPartitions() Examples
mapPartitions () applies a heavy initialization to each partition of RDD instead of each element of RDD. It is a Narrow transformation operation;...
Read more >
DataFrame.map_partitions - Dask documentation
If False, all inputs must have either the same number of partitions or a single partition. ... If not provided, dask will try...
Read more >
Statistical inference of generative network models - Graph-tool
Vertex property map with group partition. If not provided, the state's partition will be used. unlabelbool (optional, default: False ).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found