question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Importing pandas changes return value of sys.getsizeof?

See original GitHub issue

Code Sample

a = "a"
import sys
print(sys.getsizeof(a))
import pandas
print(sys.getsizeof(a))

Problem description

When I run the above code sample, I get output like this:

Python 3.5.5 |Anaconda, Inc.| (default, May 13 2018, 21:12:35) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a = "a"
>>> import sys
>>> print(sys.getsizeof(a))
50
>>> import pandas
>>> print(sys.getsizeof(a))
58

It looks like the import pandas statement changes something about the built-in strings or sys.getsizeof function to make it think single-character strings are 8-bytes longer.

Longer strings (e.g. "aa" and "aaa" etc) do not appear to be affected - we’ve tried with strings up to 100 characters long.

I’ve seen this with a small mix of Python and pandas versions, e.g. Python 3.4.3/Pandas 0.22.0 Python 3.7.0/Pandas 0.23.1

Running on 64-bit Ubuntu 16.04 LTS with all latest updates.

Expected Output

sys.getsizeof(a)

should return the same value before and after pandas is imported.

Output of pd.show_versions()

This output is from the same interpreter that I used to get my Problem description output above.

>>> pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.5.5.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-130-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8

pandas: 0.23.1 pytest: None pip: 10.0.1 setuptools: 39.2.0 Cython: None numpy: 1.14.5 scipy: None pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
danieljacobs1commented, Jul 31, 2018

I’ve had a little spare time and written a trace to investigate further. I think @TomAugspurger meant to link to https://docs.python.org/3/library/sys.html#sys.settrace - here’s my trace script that creates a breakpoint when the value of sys.getsizeof("a") changes:

#! python
import sys


def stop(frame, event, arg):
    """ Stop when sys.getsizeof("a") == 58 and the event is a call.
    """
    if event == "call" and sys.getsizeof("a") == 58:
        import pdb
        pdb.set_trace()

    return stop


def main():
    """ Print sizes of the string "a" before and after importing Pandas.
    """
    sys.settrace(stop)
    a = "a"
    print("Before importing pandas:", sys.getsizeof(a))
    import pandas
    print("After importing pandas:", sys.getsizeof(a))


if __name__ == "__main__":
    main()

This stopped at the following place.

307         return rand(*size) <= p
308  
309  
310     RANDS_CHARS = np.array(list(string.ascii_letters + string.digits),
311                            dtype=(np.str_, 1))
312  -> RANDU_CHARS = np.array(list(u("").join(map(unichr, lrange(1488, 1488 + 26))) +
313                                 string.digits), dtype=(np.unicode_, 1))
314  
315  
316     def rands_array(nchars, size, dtype='O'):
317         """Generate an array of byte strings."""
(Pdb) c

which is line 312 of the pandas/util/testing.py module. Looking at lines 310 and 311 I discovered the true culprit: numpy! It looks like whatever single-letter string is the first item of the list given to np.array, gets the “magic” extra bytes. Here’s a REPL sessions that demonstrates the issue with “b” instead of “a”:

Python 3.5.5 |Anaconda, Inc.| (default, May 13 2018, 21:12:35)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys, string
>>> sys.getsizeof('b')
50
>>> sys.getsizeof('a')
50
>>> import numpy as np
>>> sys.getsizeof('a')
50
>>> sys.getsizeof('b')
50
>>> np.array(['b'], dtype=(np.str_, 1))
array(['b'], dtype='<U1')
>>> sys.getsizeof('a')
50
>>> sys.getsizeof('b')
58
>>> sys.getsizeof('c')
50
>>>

It always appeared to happen with "a" because of the numpy array RANDS_CHARS that pandas defines which has "a" at the start.

0reactions
mroeschkecommented, Jun 20, 2021

Looks like this is no longer an issue on master. Suppose could use a test

In [1]: a = "a"
   ...: import sys
   ...: print(sys.getsizeof(a))
   ...: import pandas
   ...: print(sys.getsizeof(a))
50
50
Read more comments on GitHub >

github_iconTop Results From Across the Web

Inconsistency with `sys.getsizeof` - python - Stack Overflow
We can see that Pandas changes the size from 50 to 58, ... import ctypes import sys s = 'a' print(sys.getsizeof(s)) ctypes.pythonapi.
Read more >
sys — System-specific parameters and functions — Python ...
This module provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter.
Read more >
Difference between __sizeof__() and getsizeof() method
These are getsizeof() method and __sizeof() method. The getsizeof() is a system-specific method and hence we have to import the sys module to ......
Read more >
What's New — pandas 0.18.0 documentation - PyData |
... round(DataFrame), round(Series), round(Panel) will work (GH11763); sys.getsizeof(obj) returns the memory usage of a pandas object, including the values ...
Read more >
Memory management in Hex | Learn
Delete unnecessary variables​ · sys · import pandas as pd · # These are the usual ipython objects, including this one you are...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found