Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Perf. improvement in 'write()': Numpy slices are memoryless while pandas ones are not.

See original GitHub issue

Hi, Just realized that this evening:

import sys
import pandas as pd
import numpy as np

# Numpy
#ar = np.arange(200_000_000)
ar = np.arange(1_000_000)

sys.getsizeof(ar)
#Out[5]: 8000104
sys.getsizeof(ar[:])
#Out[6]: 104

# Pandas
df = pd.DataFrame({'a':[ar]})

sys.getsizeof(df)
#Out[8]: 8000256
sys.getsizeof(df[:])
#Out[9]: 8000256

We are doing such slices when we are splitting new data in row groups for writing in write(). After reading some SO answers, it would seems that numpy ‘simple’ slices are direct pointers to area in the original table. I could not find information about pandas case, but it seems a ‘deep copy’ is made, meaning, it creates a new DataFrame. This takes some more memory, and some time for filling new memory slots I guess (well, the time to make a copy).

A perf. improvement would be to switch current pandas slices to numpy slices.

I will not manage this in current PR write_row_groups() (its scope is already big enough I think), but can have a look in a next PR. Bests,

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Dec 9, 2021

Pandas definitely does some extra checking and extra work (making index instances and block managers) compared to numpy, but if the memory is not copied, then it shouldn’t be significant. In many cases, perhaps the most common, a single dataframe becomes a single row-group.

0reactions

yohplalacommented, Dec 12, 2021

I am aligned with this. Closing the ticket.

Top Results From Across the Web

Enhancing performance — pandas 1.5.2 documentation

We will see a speed improvement of ~200 when we use Cython and Numba on a test ... For many use cases writing...

Is there a performance difference between Numpy and Pandas?

I believe the data was just too large. Therefore I was wondering, is there a difference in computational ability when using Numpy vs...

the absolute basics for beginners — NumPy v1.25.dev0 Manual

The NumPy ndarray class is used to represent both matrices and vectors. A vector is an array with a single dimension (there's no...

Iterating Over Arrays — NumPy v1.24 Manual

This page introduces some basic ways to use the object for computations on arrays in Python, then concludes with how one can accelerate...

Structured arrays — NumPy v1.24 Manual

Aligned structures can give a performance improvement in some cases, at the cost of increased datatype size. Padding bytes are inserted between fields...