question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataFrame.copy(), at least, should be threadsafe

See original GitHub issue

dataframe.copy() should happen atomically/be threadsafe, meaning that it should produce a consistent dataframe even if the call to .copy() is made while another thread is deleting entries from the dataframe, or if another thread calls a deletion method while the call to .copy() is working (in other words, i guess .copy() should acquire a lock that prevents mutation during the copy). That is, the following code, which crashes in 0.7.3, should succeed:


import pandas
import threading

df = pandas.DataFrame()

def mutateDf(df):
    while True:
        df[0] = pandas.Series([1,2,3])
        del df[0]

def readDf(df):
    while True:
        dfCopy = df.copy()
        if 0 in dfCopy and 1 in dfCopy[0]:
            a = dfCopy[0][1]

t1 = threading.Thread(target=mutateDf, args=(df,))
t2 = threading.Thread(target=readDf, args=(df,))

t1.start()
t2.start()
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "<ipython-input-5-8aef72c7f1b4>", line 4, in readDf
    if 0 in dfCopy and 1 in dfCopy[0]:
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.7.3-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 1458, in __getitem__
    return self._get_item_cache(key)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.7.3-py2.7-linux-x86_64.egg/pandas/core/generic.py", line 294, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.7.3-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 625, in get
    _, block = self._find_block(item)
TypeError: 'NoneType' object is not iterable

Issue Analytics

  • State:open
  • Created 11 years ago
  • Comments:13 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
stuz5000commented, Aug 11, 2021

It is not “parallel” when running on the same Core - IMHO.

That is true for the Python interpreter only. Generally, threads of a single process can use multiple cores, and vectorized code (called from Python) can make use of multiple cores.

In my use case, I was hoping to use Pandas to hold a large static datatable

(~8Gb) to answer optimized web requests

That is a nice IO use case. Web requests are IO because the thread has to wait a lot of time for the data.

In my case, Python was the database. I had static data and the need to aggregate and process some 100’s of thousands of records for each request. SQL doesn’t an provide efficient query language for large matrix operations (our queries took under second in memory, but some minutes to run in SQL, even with careful indexing). This case is not unusual- using pandas or numpy to do what is too slow or cumbersome in SQL.

So, CPU bound, not IO bound, since there was no external database to wait on.

To resolve this problem, we switched to numpy for these queries, since pandas didn’t allow to support multiple queries safely.

0reactions
buhtzcommented, Aug 11, 2021

Thanks for your thoughts which help to dive more into Panda-thinking. 😉

I am aware of Pythons “GIL-problem”. But in some cases it can be used as an advantage. E.g. in the context of non-thread-safe Pandas I have to multiply the data between the processes and do not have to think about race conditions anymore.

But am I right to say that threads are running always on the same CPU core, no matter which language (C, Python) they are from, right?

There are lots of reasons to want threading that are unrelated to IO. In fact, in most languages except Python threading is the first choice for parallelism.

It is not “parallel” when running on the same Core - IMHO.

In my use case, I was hoping to use Pandas to hold a large static datatable (~8Gb) to answer optimized web requests

That is a nice IO use case. Web requests are IO because the thread has to wait a lot of time for the data.

Read more comments on GitHub >

github_iconTop Results From Across the Web

python pandas dataframe thread safe? - Stack Overflow
No, pandas is not thread safe. And its not thread safe in surprising ways. Can I delete from pandas dataframe while another thread...
Read more >
pandas.DataFrame.copy — pandas 1.5.2 documentation
Since Index is immutable, the underlying data can be safely shared and a copy is not needed. Since pandas is not thread safe,...
Read more >
1.8 Thread-safety — Pandas Doc - GitHub Pages
As of pandas 0.11, pandas is not 100% thread safe. The known issues relate to the DataFrame.copy method. If you are doing a...
Read more >
The state of multiple threading in DataFrames.jl | juliabloggers ...
DataFrames.jl we will soon merge the #2823 PR that hopefully ... julia> Threads.nthreads() 1 julia> using DataFrames julia> using ...
Read more >
Concurrent Programming Fundamentals— Thread Safety
thread-safety or thread-safe code in Java refers to code that can safely be ... If multiple thread call getCount() approximately same time each...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found