BUG: pd.DataFrame.from_records raises key error 0 when multiprocessing a data frame over multiple cores
See original GitHub issue-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
Code Sample, a copy-pastable example
import multiprocessing
from functools import partial
import time
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
tqdm.pandas(desc="Progress bar")
def some_heavy_row_function(row, mult):
a = row.sum()
b = a*mult
time.sleep(1)
return (a,b)
def fake_func_1(df, **kwargs):
# Apply some heavy row wise function
x_cols = ['A','B']
x = df.progress_apply(some_heavy_row_function, mult=kwargs['mult'], axis=1)
df[x_cols] = pd.DataFrame(data=x.tolist(), columns=x_cols, index=df.index) # v1
# add some colums
for c,v in kwargs.items():
df[c] = v
return df
def fake_func_2(df, **kwargs):
# Apply some heavy row wise function
x_cols = ['A','B']
x = df.progress_apply(some_heavy_row_function, mult=kwargs['mult'], axis=1)
df[x_cols] = pd.DataFrame.from_records(data=x, columns=x_cols, index=df.index). # v2
# add some colums
for c,v in kwargs.items():
df[c] = v
return df
def create_fake_df(n_rows):
return pd.DataFrame(np.random.rand(n_rows, 3), columns=['foo','bar','baz'])
def parallelize_dataframe(n_cores, df, func, **kwargs):
with multiprocessing.Pool(processes=n_cores) as pool:
df_splited = np.array_split(df, n_cores)
df_processed = pool.map(partial(func, **kwargs), df_splited)
df = pd.concat(df_processed)
return df
if __name__ == '__main__':
# Create an example data frame
df = create_fake_df(n_rows = 20)
print(df.head())
# "Fixed" function with three cores
n_cores = 3
df_1 = parallelize_dataframe(n_cores, df, fake_func_1, mult=10, x=1)
print(df_1.head())
# bug function with one core works
n_cores = 1
df_2 = parallelize_dataframe(n_cores, df, fake_func_2, mult=10, x=1)
print(df_2.head())
# bug function with >1 core crashes
n_cores = 3
df_3 = parallelize_dataframe(n_cores, df, fake_func_2, mult=10, x=1)
print(df_3.head())
Problem description
I donβt know why it works when I do the small workaround in function 1 (which should be roughly what from record is doing) or why from_records works with 1 core but not with multiple cores. Maybe this is obvious to some. To me it looks like a bug. Thats why I am reporting it. Also this may help others trying similar things. This error occurs under python 3.6 and 3.9; and I tested pandas versions 1.0.2 and 1.2.4 respectively.
Output
foo | bar | baz | |
---|---|---|---|
0 | 0.730996 | 0.206051 | 0.038810 |
1 | 0.369668 | 0.024069 | 0.383633 |
2 | 0.769936 | 0.758607 | 0.568493 |
3 | 0.304604 | 0.213722 | 0.612195 |
4 | 0.674473 | 0.646581 | 0.322751 |
Progress bar: 100%|ββββββββββββββββββββββββββββββββ| 6/6 [00:06<00:00, 1.00s/it] Progress bar: 100%|ββββββββββββββββββββββββββββββββ| 7/7 [00:07<00:00, 1.00s/it] Progress bar: 100%|ββββββββββββββββββββββββββββββββ| 7/7 [00:07<00:00, 1.00s/it]
foo | bar | baz | A | B | mult | x | |
---|---|---|---|---|---|---|---|
0 | 0.730996 | 0.206051 | 0.038810 | 0.975858 | 9.758577 | 10 | 1 |
1 | 0.369668 | 0.024069 | 0.383633 | 0.777371 | 7.773708 | 10 | 1 |
2 | 0.769936 | 0.758607 | 0.568493 | 2.097036 | 20.970361 | 10 | 1 |
3 | 0.304604 | 0.213722 | 0.612195 | 1.130520 | 11.305202 | 10 | 1 |
4 | 0.674473 | 0.646581 | 0.322751 | 1.643804 | 16.438041 | 10 | 1 |
Progress bar: 100%|ββββββββββββββββββββββββββββββββ| 20/20 [00:20<00:00, 1.00s/it]
foo | bar | baz | A | B | mult | x | |
---|---|---|---|---|---|---|---|
0 | 0.730996 | 0.206051 | 0.038810 | 0.975858 | 9.758577 | 10 | 1 |
1 | 0.369668 | 0.024069 | 0.383633 | 0.777371 | 7.773708 | 10 | 1 |
2 | 0.769936 | 0.758607 | 0.568493 | 2.097036 | 20.970361 | 10 | 1 |
3 | 0.304604 | 0.213722 | 0.612195 | 1.130520 | 11.305202 | 10 | 1 |
4 | 0.674473 | 0.646581 | 0.322751 | 1.643804 | 16.438041 | 10 | 1 |
Progress bar: 100%|ββββββββββββββββββββββββββββββββ| 6/6 [00:06<00:00, 1.00s/it] Progress bar: 100%|ββββββββββββββββββββββββββββββββ| 7/7 [00:07<00:00, 1.00s/it] Progress bar: 100%|ββββββββββββββββββββββββββββββββ| 7/7 [00:07<00:00, 1.00s/it]
multiprocessing.pool.RemoteTraceback: ββ" Traceback (most recent call last): File βC:.β¦\conda\conda\envs\py3_9\lib\site-packages\pandas\core\indexes\range.pyβ, line 351, in get_loc return self._range.index(new_key) ValueError: 0 is not in range
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File βC:.β¦\conda\conda\envs\py3_9\lib\multiprocessing\pool.pyβ, line 125, in worker result = (True, func(*args, **kwds)) File βC:.β¦\conda\conda\envs\py3_9\lib\multiprocessing\pool.pyβ, line 48, in mapstar return list(map(*args)) File βparalell2.pyβ, line 34, in fake_func_2 df[x_cols] = pd.DataFrame.from_records(data=x, columns=x_cols, index=df.index) File βC:.β¦\conda\conda\envs\py3_9\lib\site-packages\pandas\core\frame.pyβ, line 1855, in from_records arrays, arr_columns = to_arrays(data, columns, coerce_float=coerce_float) File βC:.β¦\conda\conda\envs\py3_9\lib\site-packages\pandas\core\internals\construction.pyβ, line 527, in to_arrays if isinstance(data[0], (list, tuple)): File βC:.β¦\conda\conda\envs\py3_9\lib\site-packages\pandas\core\series.pyβ, line 853, in getitem return self._get_value(key) File βC:.β¦\conda\conda\envs\py3_9\lib\site-packages\pandas\core\series.pyβ, line 961, in _get_value loc = self.index.get_loc(label) File βC:.β¦\conda\conda\envs\py3_9\lib\site-packages\pandas\core\indexes\range.pyβ, line 353, in get_loc raise KeyError(key) from err KeyError: 0
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 2cb96529396d93b46abab7bbc73a208e708c642e python : 3.9.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.17763 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252
pandas : 1.2.4 numpy : 1.20.2 pytz : 2021.1 dateutil : 2.8.1 pip : 21.1.1 setuptools : 49.6.0.post20210108 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.0 IPython : 7.23.1 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
@FloBay @ZurMaD I am not sure this is a bug. A
Series
is not a valid input tofrom_records
(see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_records.html). I think this fails for the same reason that this does:So it fails on the second core because the split array does not contain the index 0.
Well, code fails here on the second core trying to execute
https://github.com/pandas-dev/pandas/blob/09f3bf8083d737610aa1001f2668425be518f8f0/pandas/core/internals/construction.py#L746-L797
https://github.com/pandas-dev/pandas/blob/09f3bf8083d737610aa1001f2668425be518f8f0/pandas/core/indexes/range.py#L379-L389
and complete error is
Editing
"/home/user/.local/lib/python3.9/site-packages/pandas/core/indexes/range.py"
toprint('RANGE: ',self._range,'NEW_KEY: ',new_key)
beforreturn self._range.index(new_key)
Looks like self._range is broken, the process give me these three ranges:With the other method
pd.DataFrame
pandas donβt run the code ofself._range
It looks like index iteration after splitting DataFrame brokes the code, a PR is needed to manage this on safe multiprocessing mode that fixes the ranges, some ifs are necesarry to add in
/home/user/.local/lib/python3.9/site-packages/pandas/core/internals/construction.py