Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

map_partitions creates a `foo` column on a phantom partition

See original GitHub issue

It appears that map_partitions operates on a phantom partition, which converts all string columns to a constant value of foo:

In [1]: import dask.dataframe as dd; import pandas as pd

In [2]: base_data = pd.DataFrame({'id': ['a', 'b', 'c'], 'val': [0, 1, 2]})

In [3]: test = dd.from_pandas(base_data, npartitions=1)

In [4]: def look_inside(df):
   ...:     from IPython import embed; embed()
   ...:     return df
   ...: 

In [5]: test.map_partitions(look_inside).compute()

In the first IPython session that is triggered, we find:

In [1]: df
Out[1]: 
    id  val
0  foo    1
1  foo    1

and in the second:

In [1]: df
Out[1]: 
  id  val
0  a    0
1  b    1
2  c    2

and the final result is correct.

In general if the final result is correct this would be harmless, but sometimes processing happens that might trigger an error whenever the constant foo column is present:

In [6]: def throws_error(df):
   ...:     out = df.groupby('id').count()
   ...:     if out.max()[0] > 1:
   ...:         raise ValueError('Too many ids!')
   ...:     return out
In [7]: test.map_partitions(throws_error).compute() # <-- raises ValueError

when in fact we know the data should be immune to this error:

In [8]: throws_error(base_data)
Out[8]: 
    val
id     
a     1
b     1
c     1

I suspect the _meta_nonempty function is somehow responsible via this _simple_fake_mapping but am not familiar enough with the code path to understand why.

FWIW, I also get this error whenever I exit the above IPython session:

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "//anaconda/lib/python3.5/site-packages/IPython/core/history.py", line 785, in writeout_cache
    self._writeout_input_cache(conn)
  File "//anaconda/lib/python3.5/site-packages/IPython/core/history.py", line 769, in _writeout_input_cache
    (self.session_number,)+line)
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread.The object was created in thread id 123145322360832 and this is thread id 140735106551808

Issue Analytics

State:
Created 6 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

jcristcommented, Apr 27, 2017

Knowing this, I suppose I can write an error check to catch the error I expect to see on the fake block, and just have it pass there.

It’d be much better to pass in the metadata you expect map_partitions to return here.

1reaction

mrocklincommented, Apr 27, 2017

The cleanest thing to do is to pass in meta= explicitly if you’re able to provide the column names and dtypes

Top Results From Across the Web

PySpark mapPartitions() Examples

mapPartitions () applies a heavy initialization to each partition of RDD instead of each element of RDD. It is a Narrow transformation operation;...

pySpark convert result of mapPartitions to spark DataFrame

I have a job requires to run on a partitioned spark dataframe, and the process looks like: rdd ...

[Solved]-Spark Jdbc connection JDBCOptions-scala

How to create a connection with Oracle using Spark Scala without loading data? ... Spark : DB connection per Spark RDD partition and...

Explain Spark map() and mapPartitions() - ProjectPro

Spark mapPartitions() provides a facility to do heavy initializations (for example, Database connection) once for each partition instead of on ...

PySpark mappartitions | Learn the Internal Working ... - eduCBA

PYSPARK mapPartitions is a transformation operation that is applied to each and every partition in an RDD. It is a property of RDD...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

map_partitions creates a `foo` column on a phantom partition

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Cannot reset index of dask.dataframe when underyling pandas Dataframes have multiIndex

Optimize tensordot with rechunk