question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

map_partitions creates a `foo` column on a phantom partition

See original GitHub issue

It appears that map_partitions operates on a phantom partition, which converts all string columns to a constant value of foo:

In [1]: import dask.dataframe as dd; import pandas as pd

In [2]: base_data = pd.DataFrame({'id': ['a', 'b', 'c'], 'val': [0, 1, 2]})

In [3]: test = dd.from_pandas(base_data, npartitions=1)

In [4]: def look_inside(df):
   ...:     from IPython import embed; embed()
   ...:     return df
   ...: 

In [5]: test.map_partitions(look_inside).compute()

In the first IPython session that is triggered, we find:

In [1]: df
Out[1]: 
    id  val
0  foo    1
1  foo    1

and in the second:

In [1]: df
Out[1]: 
  id  val
0  a    0
1  b    1
2  c    2

and the final result is correct.

In general if the final result is correct this would be harmless, but sometimes processing happens that might trigger an error whenever the constant foo column is present:

In [6]: def throws_error(df):
   ...:     out = df.groupby('id').count()
   ...:     if out.max()[0] > 1:
   ...:         raise ValueError('Too many ids!')
   ...:     return out
In [7]: test.map_partitions(throws_error).compute() # <-- raises ValueError

when in fact we know the data should be immune to this error:

In [8]: throws_error(base_data)
Out[8]: 
    val
id     
a     1
b     1
c     1

I suspect the _meta_nonempty function is somehow responsible via this _simple_fake_mapping but am not familiar enough with the code path to understand why.

FWIW, I also get this error whenever I exit the above IPython session:

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "//anaconda/lib/python3.5/site-packages/IPython/core/history.py", line 785, in writeout_cache
    self._writeout_input_cache(conn)
  File "//anaconda/lib/python3.5/site-packages/IPython/core/history.py", line 769, in _writeout_input_cache
    (self.session_number,)+line)
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread.The object was created in thread id 123145322360832 and this is thread id 140735106551808

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jcristcommented, Apr 27, 2017

Knowing this, I suppose I can write an error check to catch the error I expect to see on the fake block, and just have it pass there.

It’d be much better to pass in the metadata you expect map_partitions to return here.

1reaction
mrocklincommented, Apr 27, 2017

The cleanest thing to do is to pass in meta= explicitly if you’re able to provide the column names and dtypes

Read more comments on GitHub >

github_iconTop Results From Across the Web

PySpark mapPartitions() Examples
mapPartitions () applies a heavy initialization to each partition of RDD instead of each element of RDD. It is a Narrow transformation operation;...
Read more >
pySpark convert result of mapPartitions to spark DataFrame
I have a job requires to run on a partitioned spark dataframe, and the process looks like: rdd ...
Read more >
[Solved]-Spark Jdbc connection JDBCOptions-scala
How to create a connection with Oracle using Spark Scala without loading data? ... Spark : DB connection per Spark RDD partition and...
Read more >
Explain Spark map() and mapPartitions() - ProjectPro
Spark mapPartitions() provides a facility to do heavy initializations (for example, Database connection) once for each partition instead of on ...
Read more >
PySpark mappartitions | Learn the Internal Working ... - eduCBA
PYSPARK mapPartitions is a transformation operation that is applied to each and every partition in an RDD. It is a property of RDD...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found