DataFrame.copy(), at least, should be threadsafe
See original GitHub issuedataframe.copy() should happen atomically/be threadsafe, meaning that it should produce a consistent dataframe even if the call to .copy() is made while another thread is deleting entries from the dataframe, or if another thread calls a deletion method while the call to .copy() is working (in other words, i guess .copy() should acquire a lock that prevents mutation during the copy). That is, the following code, which crashes in 0.7.3, should succeed:
import pandas
import threading
df = pandas.DataFrame()
def mutateDf(df):
while True:
df[0] = pandas.Series([1,2,3])
del df[0]
def readDf(df):
while True:
dfCopy = df.copy()
if 0 in dfCopy and 1 in dfCopy[0]:
a = dfCopy[0][1]
t1 = threading.Thread(target=mutateDf, args=(df,))
t2 = threading.Thread(target=readDf, args=(df,))
t1.start()
t2.start()
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "<ipython-input-5-8aef72c7f1b4>", line 4, in readDf
if 0 in dfCopy and 1 in dfCopy[0]:
File "/usr/local/lib/python2.7/dist-packages/pandas-0.7.3-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 1458, in __getitem__
return self._get_item_cache(key)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.7.3-py2.7-linux-x86_64.egg/pandas/core/generic.py", line 294, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.7.3-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 625, in get
_, block = self._find_block(item)
TypeError: 'NoneType' object is not iterable
Issue Analytics
- State:
- Created 11 years ago
- Comments:13 (5 by maintainers)
Top Results From Across the Web
python pandas dataframe thread safe? - Stack Overflow
No, pandas is not thread safe. And its not thread safe in surprising ways. Can I delete from pandas dataframe while another thread...
Read more >pandas.DataFrame.copy — pandas 1.5.2 documentation
Since Index is immutable, the underlying data can be safely shared and a copy is not needed. Since pandas is not thread safe,...
Read more >1.8 Thread-safety — Pandas Doc - GitHub Pages
As of pandas 0.11, pandas is not 100% thread safe. The known issues relate to the DataFrame.copy method. If you are doing a...
Read more >The state of multiple threading in DataFrames.jl | juliabloggers ...
DataFrames.jl we will soon merge the #2823 PR that hopefully ... julia> Threads.nthreads() 1 julia> using DataFrames julia> using ...
Read more >Concurrent Programming Fundamentals— Thread Safety
thread-safety or thread-safe code in Java refers to code that can safely be ... If multiple thread call getCount() approximately same time each...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
That is true for the Python interpreter only. Generally, threads of a single process can use multiple cores, and vectorized code (called from Python) can make use of multiple cores.
In my use case, I was hoping to use Pandas to hold a large static datatable
In my case, Python was the database. I had static data and the need to aggregate and process some 100’s of thousands of records for each request. SQL doesn’t an provide efficient query language for large matrix operations (our queries took under second in memory, but some minutes to run in SQL, even with careful indexing). This case is not unusual- using pandas or numpy to do what is too slow or cumbersome in SQL.
So, CPU bound, not IO bound, since there was no external database to wait on.
To resolve this problem, we switched to numpy for these queries, since pandas didn’t allow to support multiple queries safely.
Thanks for your thoughts which help to dive more into Panda-thinking. 😉
I am aware of Pythons “GIL-problem”. But in some cases it can be used as an advantage. E.g. in the context of non-thread-safe Pandas I have to multiply the data between the processes and do not have to think about race conditions anymore.
But am I right to say that threads are running always on the same CPU core, no matter which language (C, Python) they are from, right?
It is not “parallel” when running on the same Core - IMHO.
That is a nice IO use case. Web requests are IO because the thread has to wait a lot of time for the data.