Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: pd.DataFrame.from_records raises key error 0 when multiprocessing a data frame over multiple cores

See original GitHub issue

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.

Code Sample, a copy-pastable example

import multiprocessing
from functools import partial
import time

import pandas as pd
import numpy as np
from tqdm.auto import tqdm

tqdm.pandas(desc="Progress bar")


def some_heavy_row_function(row, mult):
    a = row.sum()
    b = a*mult
    time.sleep(1)
    return (a,b)


def fake_func_1(df, **kwargs):
    # Apply some heavy row wise function
    x_cols = ['A','B']
    x = df.progress_apply(some_heavy_row_function, mult=kwargs['mult'], axis=1)
    df[x_cols] = pd.DataFrame(data=x.tolist(), columns=x_cols, index=df.index)  # v1
    # add some colums
    for c,v in kwargs.items():
        df[c] = v
    return df


def fake_func_2(df, **kwargs):
    # Apply some heavy row wise function
    x_cols = ['A','B']
    x = df.progress_apply(some_heavy_row_function, mult=kwargs['mult'], axis=1)
    df[x_cols] = pd.DataFrame.from_records(data=x, columns=x_cols, index=df.index). # v2
    # add some colums
    for c,v in kwargs.items():
        df[c] = v
    return df


def create_fake_df(n_rows):
    return pd.DataFrame(np.random.rand(n_rows, 3), columns=['foo','bar','baz'])


def parallelize_dataframe(n_cores, df, func, **kwargs):
    with multiprocessing.Pool(processes=n_cores) as pool:
        df_splited = np.array_split(df, n_cores)
        df_processed = pool.map(partial(func, **kwargs), df_splited)
        df = pd.concat(df_processed)
    return df
        

if __name__ == '__main__':

    # Create an example data frame
    df = create_fake_df(n_rows = 20)
    print(df.head())
    
    # "Fixed" function with three cores 
    n_cores = 3
    df_1 = parallelize_dataframe(n_cores, df, fake_func_1, mult=10, x=1) 
    print(df_1.head())
    
    # bug function with one core works
    n_cores = 1
    df_2 = parallelize_dataframe(n_cores, df, fake_func_2, mult=10, x=1) 
    print(df_2.head())
    
    # bug function with >1 core crashes
    n_cores = 3
    df_3 = parallelize_dataframe(n_cores, df, fake_func_2, mult=10, x=1) 
    print(df_3.head())

Problem description

I don’t know why it works when I do the small workaround in function 1 (which should be roughly what from record is doing) or why from_records works with 1 core but not with multiple cores. Maybe this is obvious to some. To me it looks like a bug. Thats why I am reporting it. Also this may help others trying similar things. This error occurs under python 3.6 and 3.9; and I tested pandas versions 1.0.2 and 1.2.4 respectively.

Output

	foo	bar	baz
0	0.730996	0.206051	0.038810
1	0.369668	0.024069	0.383633
2	0.769936	0.758607	0.568493
3	0.304604	0.213722	0.612195
4	0.674473	0.646581	0.322751

Progress bar: 100%|████████████████████████████████| 6/6 [00:06<00:00, 1.00s/it] Progress bar: 100%|████████████████████████████████| 7/7 [00:07<00:00, 1.00s/it] Progress bar: 100%|████████████████████████████████| 7/7 [00:07<00:00, 1.00s/it]

	foo	bar	baz	A	B	mult	x
0	0.730996	0.206051	0.038810	0.975858	9.758577	10	1
1	0.369668	0.024069	0.383633	0.777371	7.773708	10	1
2	0.769936	0.758607	0.568493	2.097036	20.970361	10	1
3	0.304604	0.213722	0.612195	1.130520	11.305202	10	1
4	0.674473	0.646581	0.322751	1.643804	16.438041	10	1

Progress bar: 100%|████████████████████████████████| 20/20 [00:20<00:00, 1.00s/it]

	foo	bar	baz	A	B	mult	x
0	0.730996	0.206051	0.038810	0.975858	9.758577	10	1
1	0.369668	0.024069	0.383633	0.777371	7.773708	10	1
2	0.769936	0.758607	0.568493	2.097036	20.970361	10	1
3	0.304604	0.213722	0.612195	1.130520	11.305202	10	1
4	0.674473	0.646581	0.322751	1.643804	16.438041	10	1

multiprocessing.pool.RemoteTraceback: “”" Traceback (most recent call last): File “C:.…\conda\conda\envs\py3_9\lib\site-packages\pandas\core\indexes\range.py”, line 351, in get_loc return self._range.index(new_key) ValueError: 0 is not in range

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File “C:.…\conda\conda\envs\py3_9\lib\multiprocessing\pool.py”, line 125, in worker result = (True, func(*args, **kwds)) File “C:.…\conda\conda\envs\py3_9\lib\multiprocessing\pool.py”, line 48, in mapstar return list(map(*args)) File “paralell2.py”, line 34, in fake_func_2 df[x_cols] = pd.DataFrame.from_records(data=x, columns=x_cols, index=df.index) File “C:.…\conda\conda\envs\py3_9\lib\site-packages\pandas\core\frame.py”, line 1855, in from_records arrays, arr_columns = to_arrays(data, columns, coerce_float=coerce_float) File “C:.…\conda\conda\envs\py3_9\lib\site-packages\pandas\core\internals\construction.py”, line 527, in to_arrays if isinstance(data[0], (list, tuple)): File “C:.…\conda\conda\envs\py3_9\lib\site-packages\pandas\core\series.py”, line 853, in getitem return self._get_value(key) File “C:.…\conda\conda\envs\py3_9\lib\site-packages\pandas\core\series.py”, line 961, in _get_value loc = self.index.get_loc(label) File “C:.…\conda\conda\envs\py3_9\lib\site-packages\pandas\core\indexes\range.py”, line 353, in get_loc raise KeyError(key) from err KeyError: 0

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 2cb96529396d93b46abab7bbc73a208e708c642e python : 3.9.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.17763 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252

pandas : 1.2.4 numpy : 1.20.2 pytz : 2021.1 dateutil : 2.8.1 pip : 21.1.1 setuptools : 49.6.0.post20210108 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.0 IPython : 7.23.1 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

Issue Analytics

State:
Created 2 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

mzeitlin11commented, May 24, 2021

@FloBay @ZurMaD I am not sure this is a bug. A Series is not a valid input to from_records (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_records.html). I think this fails for the same reason that this does:

x = pd.Series([1], index=[1])
df = pd.DataFrame.from_records(x)

So it fails on the second core because the split array does not contain the index 0.

1reaction

pablodzcommented, May 23, 2021

Well, code fails here on the second core trying to execute

> pd.DataFrame.from_records(data=x, columns=x_cols, index=df.index)

https://github.com/pandas-dev/pandas/blob/09f3bf8083d737610aa1001f2668425be518f8f0/pandas/core/internals/construction.py#L746-L797

https://github.com/pandas-dev/pandas/blob/09f3bf8083d737610aa1001f2668425be518f8f0/pandas/core/indexes/range.py#L379-L389

and complete error is


Traceback (most recent call last):
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/indexes/range.py", line 351, in get_loc
    return self._range.index(new_key)
ValueError: 0 is not in range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/Downloads/try1.py", line 23, in fake_func_2
    df_working= pd.DataFrame.from_records(data=x, columns=x_cols, index=df.index) # v2
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 1855, in from_records
    arrays, arr_columns = to_arrays(data, columns, coerce_float=coerce_float)
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 527, in to_arrays
    if isinstance(data[0], (list, tuple)):
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/series.py", line 853, in __getitem__
    return self._get_value(key)
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/series.py", line 961, in _get_value
    loc = self.index.get_loc(label)
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/indexes/range.py", line 353, in get_loc
    raise KeyError(key) from err
KeyError: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/indexes/range.py", line 351, in get_loc
    return self._range.index(new_key)
ValueError: 0 is not in range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 1855, in from_records
    arrays, arr_columns = to_arrays(data, columns, coerce_float=coerce_float)
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 527, in to_arrays
    if isinstance(data[0], (list, tuple)):
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/series.py", line 853, in __getitem__
    return self._get_value(key)
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/series.py", line 961, in _get_value
    loc = self.index.get_loc(label)
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/indexes/range.py", line 353, in get_loc
    raise KeyError(key) from err
KeyError: 0

Editing "/home/user/.local/lib/python3.9/site-packages/pandas/core/indexes/range.py" to print('RANGE: ',self._range,'NEW_KEY: ',new_key) befor return self._range.index(new_key) Looks like self._range is broken, the process give me these three ranges:

# Executing on multiprocessing, so order changes
RANGE: range(2, 3) NEW_KEY: 0 #ValueError: 0 is not in range
RANGE: range(3, 4) NEW_KEY: 0 #ValueError: 0 is not in range
RANGE: range(0, 2) NEW_KEY: 0
RANGE: range(0, 2) NEW_KEY: 0

With the other method pd.DataFrame pandas don’t run the code of self._range

It looks like index iteration after splitting DataFrame brokes the code, a PR is needed to manage this on safe multiprocessing mode that fixes the ranges, some ifs are necesarry to add in /home/user/.local/lib/python3.9/site-packages/pandas/core/internals/construction.py

Top Results From Across the Web

Pandas DataFrames KeyError:0 - python - Stack Overflow

I'm just trying to pull the first column of urls from the table off this site. And i keep running into KeyError: 0....

mcflyin/AllPandas.csv at master - GitHub

Id PostTypeId CreationDate Score OwnerUserId AnswerCount CommentC... 5515021 1 2011‑04‑01 14:50:44 5 687739 3 6 8916302 1 2012‑01‑18 19:41:27 7 248237 2 8966871 1 2012‑01‑23 03:21:00...

How to Speed up Pandas by 4x with one line of code

For a Pandas DataFrame, a basic idea would be to divide up the DataFrame into a few pieces, as many pieces as you...

[Example code]-I'm doing the cryptopals challenge 5. I'm getting a ...

Coding example for the question I'm doing the cryptopals challenge 5. I'm getting a syntax error in line 5 when trying to run...

Release Notes — LSST Science Pipelines 12.0 documentation

This (along with a new class called BaseFakeSourcesTask ) sets up a frame work that others may use to introduce known fake sources...