Types of a fake partition in map_partition inferring are not exactly correct
See original GitHub issueConsider the following code:
import dask.dataframe
import pandas
def qq(partition_df):
print(partition_df)
print(partition_df['X'].dtype)
return partition_df
df = dask.dataframe.from_pandas(pandas.DataFrame({'X': range(100)}, dtype='int32'), npartitions=1)
print(df['X'].dtype)
df.map_partitions(qq).compute()
The output is:
float32
float64
float32
while I expect to see:
float32
float32
float32
Can the types be just copied from the original dataframe? How can I distinguish fake partitions from true ones? I have a heavy computation which I would rather avoid on fake partitions. In Dask 0.10 fake partition was empty, but Dask 0.11 passes a partition with two records in it.
Issue Analytics
- State:
- Created 7 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
pySpark convert result of mapPartitions to spark DataFrame
If you want to stay with rdd api. mapPartitions accepts an iterator of a type and expects an iterator of another type as...
Read more >pyspark.rdd — PySpark master documentation - Apache Spark
RDD transformations and actions can only be invoked by the " "driver, not inside of ... [docs] def mapPartitions(self, f, preservesPartitioning=False): ...
Read more >PySpark mapPartitions() Examples
mapPartitions () applies a heavy initialization to each partition of RDD instead of each element of RDD. It is a Narrow transformation operation;...
Read more >DataFrame.map_partitions - Dask documentation
If False, all inputs must have either the same number of partitions or a single partition. ... If not provided, dask will try...
Read more >Statistical inference of generative network models - Graph-tool
Vertex property map with group partition. If not provided, the state's partition will be used. unlabelbool (optional, default: False ).
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The bug causing numeric dtype size inference issues is fixed in #1513.
Note however that dtype inference in pandas is tricky due to all the implicit casting that pandas will do internally. For most simple user functions we should be able to guess dtype correctly, but there will be cases where we fail, and you should provide the
meta
keyword (see docstring here: http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions). I’ll write up a section for the docs in a separate PR.I have found that I can distinguish fake partitions by looking at[For some reason it is not always true.]partition_df.is_copy
value, which isNone
for fake partitions andweakref
for true partitions!