Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Importing pandas changes return value of sys.getsizeof?

See original GitHub issue

Code Sample

a = "a"
import sys
print(sys.getsizeof(a))
import pandas
print(sys.getsizeof(a))

Problem description

When I run the above code sample, I get output like this:

Python 3.5.5 |Anaconda, Inc.| (default, May 13 2018, 21:12:35) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a = "a"
>>> import sys
>>> print(sys.getsizeof(a))
50
>>> import pandas
>>> print(sys.getsizeof(a))
58

It looks like the import pandas statement changes something about the built-in strings or sys.getsizeof function to make it think single-character strings are 8-bytes longer.

Longer strings (e.g. "aa" and "aaa" etc) do not appear to be affected - we’ve tried with strings up to 100 characters long.

I’ve seen this with a small mix of Python and pandas versions, e.g. Python 3.4.3/Pandas 0.22.0 Python 3.7.0/Pandas 0.23.1

Running on 64-bit Ubuntu 16.04 LTS with all latest updates.

Expected Output

sys.getsizeof(a)

should return the same value before and after pandas is imported.

Output of `pd.show_versions()`

This output is from the same interpreter that I used to get my Problem description output above.

>>> pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.5.5.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-130-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8

pandas: 0.23.1 pytest: None pip: 10.0.1 setuptools: 39.2.0 Cython: None numpy: 1.14.5 scipy: None pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 5 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

danieljacobs1commented, Jul 31, 2018

I’ve had a little spare time and written a trace to investigate further. I think @TomAugspurger meant to link to https://docs.python.org/3/library/sys.html#sys.settrace - here’s my trace script that creates a breakpoint when the value of sys.getsizeof("a") changes:

#! python
import sys


def stop(frame, event, arg):
    """ Stop when sys.getsizeof("a") == 58 and the event is a call.
    """
    if event == "call" and sys.getsizeof("a") == 58:
        import pdb
        pdb.set_trace()

    return stop


def main():
    """ Print sizes of the string "a" before and after importing Pandas.
    """
    sys.settrace(stop)
    a = "a"
    print("Before importing pandas:", sys.getsizeof(a))
    import pandas
    print("After importing pandas:", sys.getsizeof(a))


if __name__ == "__main__":
    main()

This stopped at the following place.

307         return rand(*size) <= p
308  
309  
310     RANDS_CHARS = np.array(list(string.ascii_letters + string.digits),
311                            dtype=(np.str_, 1))
312  -> RANDU_CHARS = np.array(list(u("").join(map(unichr, lrange(1488, 1488 + 26))) +
313                                 string.digits), dtype=(np.unicode_, 1))
314  
315  
316     def rands_array(nchars, size, dtype='O'):
317         """Generate an array of byte strings."""
(Pdb) c

which is line 312 of the pandas/util/testing.py module. Looking at lines 310 and 311 I discovered the true culprit: numpy! It looks like whatever single-letter string is the first item of the list given to np.array, gets the “magic” extra bytes. Here’s a REPL sessions that demonstrates the issue with “b” instead of “a”:

Python 3.5.5 |Anaconda, Inc.| (default, May 13 2018, 21:12:35)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys, string
>>> sys.getsizeof('b')
50
>>> sys.getsizeof('a')
50
>>> import numpy as np
>>> sys.getsizeof('a')
50
>>> sys.getsizeof('b')
50
>>> np.array(['b'], dtype=(np.str_, 1))
array(['b'], dtype='<U1')
>>> sys.getsizeof('a')
50
>>> sys.getsizeof('b')
58
>>> sys.getsizeof('c')
50
>>>

It always appeared to happen with "a" because of the numpy array RANDS_CHARS that pandas defines which has "a" at the start.

0reactions

mroeschkecommented, Jun 20, 2021

Looks like this is no longer an issue on master. Suppose could use a test

In [1]: a = "a"
   ...: import sys
   ...: print(sys.getsizeof(a))
   ...: import pandas
   ...: print(sys.getsizeof(a))
50
50

Top Results From Across the Web

Inconsistency with `sys.getsizeof` - python - Stack Overflow

We can see that Pandas changes the size from 50 to 58, ... import ctypes import sys s = 'a' print(sys.getsizeof(s)) ctypes.pythonapi.

sys — System-specific parameters and functions — Python ...

This module provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter.

Difference between __sizeof__() and getsizeof() method

These are getsizeof() method and __sizeof() method. The getsizeof() is a system-specific method and hence we have to import the sys module to ......

What's New — pandas 0.18.0 documentation - PyData |

... round(DataFrame), round(Series), round(Panel) will work (GH11763); sys.getsizeof(obj) returns the memory usage of a pandas object, including the values ...

Memory management in Hex | Learn

Delete unnecessary variables · sys · import pandas as pd · # These are the usual ipython objects, including this one you are...