question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: pd.DataFrame.from_records raises key error 0 when multiprocessing a data frame over multiple cores

See original GitHub issue
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

Code Sample, a copy-pastable example

import multiprocessing
from functools import partial
import time

import pandas as pd
import numpy as np
from tqdm.auto import tqdm

tqdm.pandas(desc="Progress bar")


def some_heavy_row_function(row, mult):
    a = row.sum()
    b = a*mult
    time.sleep(1)
    return (a,b)


def fake_func_1(df, **kwargs):
    # Apply some heavy row wise function
    x_cols = ['A','B']
    x = df.progress_apply(some_heavy_row_function, mult=kwargs['mult'], axis=1)
    df[x_cols] = pd.DataFrame(data=x.tolist(), columns=x_cols, index=df.index)  # v1
    # add some colums
    for c,v in kwargs.items():
        df[c] = v
    return df


def fake_func_2(df, **kwargs):
    # Apply some heavy row wise function
    x_cols = ['A','B']
    x = df.progress_apply(some_heavy_row_function, mult=kwargs['mult'], axis=1)
    df[x_cols] = pd.DataFrame.from_records(data=x, columns=x_cols, index=df.index). # v2
    # add some colums
    for c,v in kwargs.items():
        df[c] = v
    return df


def create_fake_df(n_rows):
    return pd.DataFrame(np.random.rand(n_rows, 3), columns=['foo','bar','baz'])


def parallelize_dataframe(n_cores, df, func, **kwargs):
    with multiprocessing.Pool(processes=n_cores) as pool:
        df_splited = np.array_split(df, n_cores)
        df_processed = pool.map(partial(func, **kwargs), df_splited)
        df = pd.concat(df_processed)
    return df
        

if __name__ == '__main__':

    # Create an example data frame
    df = create_fake_df(n_rows = 20)
    print(df.head())
    
    # "Fixed" function with three cores 
    n_cores = 3
    df_1 = parallelize_dataframe(n_cores, df, fake_func_1, mult=10, x=1) 
    print(df_1.head())
    
    # bug function with one core works
    n_cores = 1
    df_2 = parallelize_dataframe(n_cores, df, fake_func_2, mult=10, x=1) 
    print(df_2.head())
    
    # bug function with >1 core crashes
    n_cores = 3
    df_3 = parallelize_dataframe(n_cores, df, fake_func_2, mult=10, x=1) 
    print(df_3.head())  

Problem description

I don’t know why it works when I do the small workaround in function 1 (which should be roughly what from record is doing) or why from_records works with 1 core but not with multiple cores. Maybe this is obvious to some. To me it looks like a bug. Thats why I am reporting it. Also this may help others trying similar things. This error occurs under python 3.6 and 3.9; and I tested pandas versions 1.0.2 and 1.2.4 respectively.

Output

foo bar baz
0 0.730996 0.206051 0.038810
1 0.369668 0.024069 0.383633
2 0.769936 0.758607 0.568493
3 0.304604 0.213722 0.612195
4 0.674473 0.646581 0.322751

Progress bar: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6/6 [00:06<00:00, 1.00s/it] Progress bar: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:07<00:00, 1.00s/it] Progress bar: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:07<00:00, 1.00s/it]

foo bar baz A B mult x
0 0.730996 0.206051 0.038810 0.975858 9.758577 10 1
1 0.369668 0.024069 0.383633 0.777371 7.773708 10 1
2 0.769936 0.758607 0.568493 2.097036 20.970361 10 1
3 0.304604 0.213722 0.612195 1.130520 11.305202 10 1
4 0.674473 0.646581 0.322751 1.643804 16.438041 10 1

Progress bar: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 20/20 [00:20<00:00, 1.00s/it]

foo bar baz A B mult x
0 0.730996 0.206051 0.038810 0.975858 9.758577 10 1
1 0.369668 0.024069 0.383633 0.777371 7.773708 10 1
2 0.769936 0.758607 0.568493 2.097036 20.970361 10 1
3 0.304604 0.213722 0.612195 1.130520 11.305202 10 1
4 0.674473 0.646581 0.322751 1.643804 16.438041 10 1

Progress bar: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 6/6 [00:06<00:00, 1.00s/it] Progress bar: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:07<00:00, 1.00s/it] Progress bar: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7/7 [00:07<00:00, 1.00s/it]

multiprocessing.pool.RemoteTraceback: β€œβ€" Traceback (most recent call last): File β€œC:.…\conda\conda\envs\py3_9\lib\site-packages\pandas\core\indexes\range.py”, line 351, in get_loc return self._range.index(new_key) ValueError: 0 is not in range

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File β€œC:.…\conda\conda\envs\py3_9\lib\multiprocessing\pool.py”, line 125, in worker result = (True, func(*args, **kwds)) File β€œC:.…\conda\conda\envs\py3_9\lib\multiprocessing\pool.py”, line 48, in mapstar return list(map(*args)) File β€œparalell2.py”, line 34, in fake_func_2 df[x_cols] = pd.DataFrame.from_records(data=x, columns=x_cols, index=df.index) File β€œC:.…\conda\conda\envs\py3_9\lib\site-packages\pandas\core\frame.py”, line 1855, in from_records arrays, arr_columns = to_arrays(data, columns, coerce_float=coerce_float) File β€œC:.…\conda\conda\envs\py3_9\lib\site-packages\pandas\core\internals\construction.py”, line 527, in to_arrays if isinstance(data[0], (list, tuple)): File β€œC:.…\conda\conda\envs\py3_9\lib\site-packages\pandas\core\series.py”, line 853, in getitem return self._get_value(key) File β€œC:.…\conda\conda\envs\py3_9\lib\site-packages\pandas\core\series.py”, line 961, in _get_value loc = self.index.get_loc(label) File β€œC:.…\conda\conda\envs\py3_9\lib\site-packages\pandas\core\indexes\range.py”, line 353, in get_loc raise KeyError(key) from err KeyError: 0

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2cb96529396d93b46abab7bbc73a208e708c642e python : 3.9.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.17763 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_United States.1252

pandas : 1.2.4 numpy : 1.20.2 pytz : 2021.1 dateutil : 2.8.1 pip : 21.1.1 setuptools : 49.6.0.post20210108 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.0 IPython : 7.23.1 pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.4.2 numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.3 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
mzeitlin11commented, May 24, 2021

@FloBay @ZurMaD I am not sure this is a bug. A Series is not a valid input to from_records (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_records.html). I think this fails for the same reason that this does:

x = pd.Series([1], index=[1])
df = pd.DataFrame.from_records(x)

So it fails on the second core because the split array does not contain the index 0.

1reaction
pablodzcommented, May 23, 2021

Well, code fails here on the second core trying to execute

> pd.DataFrame.from_records(data=x, columns=x_cols, index=df.index)

https://github.com/pandas-dev/pandas/blob/09f3bf8083d737610aa1001f2668425be518f8f0/pandas/core/internals/construction.py#L746-L797

https://github.com/pandas-dev/pandas/blob/09f3bf8083d737610aa1001f2668425be518f8f0/pandas/core/indexes/range.py#L379-L389

and complete error is


Traceback (most recent call last):
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/indexes/range.py", line 351, in get_loc
    return self._range.index(new_key)
ValueError: 0 is not in range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/Downloads/try1.py", line 23, in fake_func_2
    df_working= pd.DataFrame.from_records(data=x, columns=x_cols, index=df.index) # v2
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 1855, in from_records
    arrays, arr_columns = to_arrays(data, columns, coerce_float=coerce_float)
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 527, in to_arrays
    if isinstance(data[0], (list, tuple)):
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/series.py", line 853, in __getitem__
    return self._get_value(key)
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/series.py", line 961, in _get_value
    loc = self.index.get_loc(label)
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/indexes/range.py", line 353, in get_loc
    raise KeyError(key) from err
KeyError: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/indexes/range.py", line 351, in get_loc
    return self._range.index(new_key)
ValueError: 0 is not in range

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 1855, in from_records
    arrays, arr_columns = to_arrays(data, columns, coerce_float=coerce_float)
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 527, in to_arrays
    if isinstance(data[0], (list, tuple)):
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/series.py", line 853, in __getitem__
    return self._get_value(key)
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/series.py", line 961, in _get_value
    loc = self.index.get_loc(label)
  File "/home/user/.local/lib/python3.9/site-packages/pandas/core/indexes/range.py", line 353, in get_loc
    raise KeyError(key) from err
KeyError: 0

Editing "/home/user/.local/lib/python3.9/site-packages/pandas/core/indexes/range.py" to print('RANGE: ',self._range,'NEW_KEY: ',new_key) befor return self._range.index(new_key) Looks like self._range is broken, the process give me these three ranges:

# Executing on multiprocessing, so order changes
RANGE: range(2, 3) NEW_KEY: 0 #ValueError: 0 is not in range
RANGE: range(3, 4) NEW_KEY: 0 #ValueError: 0 is not in range
RANGE: range(0, 2) NEW_KEY: 0
RANGE: range(0, 2) NEW_KEY: 0

With the other method pd.DataFrame pandas don’t run the code of self._range

It looks like index iteration after splitting DataFrame brokes the code, a PR is needed to manage this on safe multiprocessing mode that fixes the ranges, some ifs are necesarry to add in /home/user/.local/lib/python3.9/site-packages/pandas/core/internals/construction.py

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas DataFrames KeyError:0 - python - Stack Overflow
I'm just trying to pull the first column of urls from the table off this site. And i keep running into KeyError: 0....
Read more >
mcflyin/AllPandas.csv at master - GitHub
Id PostTypeId CreationDate Score OwnerUserId AnswerCount CommentC... 5515021 1 2011‑04‑01 14:50:44 5 687739 3 6 8916302 1 2012‑01‑18 19:41:27 7 248237 2 8966871 1 2012‑01‑23 03:21:00...
Read more >
How to Speed up Pandas by 4x with one line of code
For a Pandas DataFrame, a basic idea would be to divide up the DataFrame into a few pieces, as many pieces as you...
Read more >
[Example code]-I'm doing the cryptopals challenge 5. I'm getting a ...
Coding example for the question I'm doing the cryptopals challenge 5. I'm getting a syntax error in line 5 when trying to run...
Read more >
Release Notes β€” LSST Science Pipelines 12.0 documentation
This (along with a new class called BaseFakeSourcesTask ) sets up a frame work that others may use to introduce known fake sources...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found