question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Perf. improvement in 'write()': Numpy slices are memoryless while pandas ones are not.

See original GitHub issue

Hi, Just realized that this evening:

import sys
import pandas as pd
import numpy as np

# Numpy
#ar = np.arange(200_000_000)
ar = np.arange(1_000_000)

sys.getsizeof(ar)
#Out[5]: 8000104
sys.getsizeof(ar[:])
#Out[6]: 104

# Pandas
df = pd.DataFrame({'a':[ar]})

sys.getsizeof(df)
#Out[8]: 8000256
sys.getsizeof(df[:])
#Out[9]: 8000256

We are doing such slices when we are splitting new data in row groups for writing in write(). After reading some SO answers, it would seems that numpy ‘simple’ slices are direct pointers to area in the original table. I could not find information about pandas case, but it seems a ‘deep copy’ is made, meaning, it creates a new DataFrame. This takes some more memory, and some time for filling new memory slots I guess (well, the time to make a copy).

A perf. improvement would be to switch current pandas slices to numpy slices.

I will not manage this in current PR write_row_groups() (its scope is already big enough I think), but can have a look in a next PR. Bests,

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Dec 9, 2021

Pandas definitely does some extra checking and extra work (making index instances and block managers) compared to numpy, but if the memory is not copied, then it shouldn’t be significant. In many cases, perhaps the most common, a single dataframe becomes a single row-group.

0reactions
yohplalacommented, Dec 12, 2021

I am aligned with this. Closing the ticket.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Enhancing performance — pandas 1.5.2 documentation
We will see a speed improvement of ~200 when we use Cython and Numba on a test ... For many use cases writing...
Read more >
Is there a performance difference between Numpy and Pandas?
I believe the data was just too large. Therefore I was wondering, is there a difference in computational ability when using Numpy vs...
Read more >
the absolute basics for beginners — NumPy v1.25.dev0 Manual
The NumPy ndarray class is used to represent both matrices and vectors. A vector is an array with a single dimension (there's no...
Read more >
Iterating Over Arrays — NumPy v1.24 Manual
This page introduces some basic ways to use the object for computations on arrays in Python, then concludes with how one can accelerate...
Read more >
Structured arrays — NumPy v1.24 Manual
Aligned structures can give a performance improvement in some cases, at the cost of increased datatype size. Padding bytes are inserted between fields...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found