Perf. improvement in 'write()': Numpy slices are memoryless while pandas ones are not.
See original GitHub issueHi, Just realized that this evening:
import sys
import pandas as pd
import numpy as np
# Numpy
#ar = np.arange(200_000_000)
ar = np.arange(1_000_000)
sys.getsizeof(ar)
#Out[5]: 8000104
sys.getsizeof(ar[:])
#Out[6]: 104
# Pandas
df = pd.DataFrame({'a':[ar]})
sys.getsizeof(df)
#Out[8]: 8000256
sys.getsizeof(df[:])
#Out[9]: 8000256
We are doing such slices when we are splitting new data in row groups for writing in write()
.
After reading some SO answers, it would seems that numpy ‘simple’ slices are direct pointers to area in the original table.
I could not find information about pandas case, but it seems a ‘deep copy’ is made, meaning, it creates a new DataFrame
.
This takes some more memory, and some time for filling new memory slots I guess (well, the time to make a copy).
A perf. improvement would be to switch current pandas slices to numpy slices.
I will not manage this in current PR write_row_groups()
(its scope is already big enough I think), but can have a look in a next PR.
Bests,
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Enhancing performance — pandas 1.5.2 documentation
We will see a speed improvement of ~200 when we use Cython and Numba on a test ... For many use cases writing...
Read more >Is there a performance difference between Numpy and Pandas?
I believe the data was just too large. Therefore I was wondering, is there a difference in computational ability when using Numpy vs...
Read more >the absolute basics for beginners — NumPy v1.25.dev0 Manual
The NumPy ndarray class is used to represent both matrices and vectors. A vector is an array with a single dimension (there's no...
Read more >Iterating Over Arrays — NumPy v1.24 Manual
This page introduces some basic ways to use the object for computations on arrays in Python, then concludes with how one can accelerate...
Read more >Structured arrays — NumPy v1.24 Manual
Aligned structures can give a performance improvement in some cases, at the cost of increased datatype size. Padding bytes are inserted between fields...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Pandas definitely does some extra checking and extra work (making index instances and block managers) compared to numpy, but if the memory is not copied, then it shouldn’t be significant. In many cases, perhaps the most common, a single dataframe becomes a single row-group.
I am aligned with this. Closing the ticket.