map_partitions creates a `foo` column on a phantom partition
See original GitHub issueIt appears that map_partitions
operates on a phantom partition, which converts all string columns to a constant value of foo
:
In [1]: import dask.dataframe as dd; import pandas as pd
In [2]: base_data = pd.DataFrame({'id': ['a', 'b', 'c'], 'val': [0, 1, 2]})
In [3]: test = dd.from_pandas(base_data, npartitions=1)
In [4]: def look_inside(df):
...: from IPython import embed; embed()
...: return df
...:
In [5]: test.map_partitions(look_inside).compute()
In the first IPython
session that is triggered, we find:
In [1]: df
Out[1]:
id val
0 foo 1
1 foo 1
and in the second:
In [1]: df
Out[1]:
id val
0 a 0
1 b 1
2 c 2
and the final result is correct.
In general if the final result is correct this would be harmless, but sometimes processing happens that might trigger an error whenever the constant foo
column is present:
In [6]: def throws_error(df):
...: out = df.groupby('id').count()
...: if out.max()[0] > 1:
...: raise ValueError('Too many ids!')
...: return out
In [7]: test.map_partitions(throws_error).compute() # <-- raises ValueError
when in fact we know the data should be immune to this error:
In [8]: throws_error(base_data)
Out[8]:
val
id
a 1
b 1
c 1
I suspect the _meta_nonempty
function is somehow responsible via this _simple_fake_mapping
but am not familiar enough with the code path to understand why.
FWIW, I also get this error whenever I exit the above IPython
session:
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "//anaconda/lib/python3.5/site-packages/IPython/core/history.py", line 785, in writeout_cache
self._writeout_input_cache(conn)
File "//anaconda/lib/python3.5/site-packages/IPython/core/history.py", line 769, in _writeout_input_cache
(self.session_number,)+line)
sqlite3.ProgrammingError: SQLite objects created in a thread can only be used in that same thread.The object was created in thread id 123145322360832 and this is thread id 140735106551808
Issue Analytics
- State:
- Created 6 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
PySpark mapPartitions() Examples
mapPartitions () applies a heavy initialization to each partition of RDD instead of each element of RDD. It is a Narrow transformation operation;...
Read more >pySpark convert result of mapPartitions to spark DataFrame
I have a job requires to run on a partitioned spark dataframe, and the process looks like: rdd ...
Read more >[Solved]-Spark Jdbc connection JDBCOptions-scala
How to create a connection with Oracle using Spark Scala without loading data? ... Spark : DB connection per Spark RDD partition and...
Read more >Explain Spark map() and mapPartitions() - ProjectPro
Spark mapPartitions() provides a facility to do heavy initializations (for example, Database connection) once for each partition instead of on ...
Read more >PySpark mappartitions | Learn the Internal Working ... - eduCBA
PYSPARK mapPartitions is a transformation operation that is applied to each and every partition in an RDD. It is a property of RDD...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It’d be much better to pass in the metadata you expect
map_partitions
to return here.The cleanest thing to do is to pass in
meta=
explicitly if you’re able to provide the column names and dtypes