Importing pandas changes return value of sys.getsizeof?
See original GitHub issueCode Sample
a = "a"
import sys
print(sys.getsizeof(a))
import pandas
print(sys.getsizeof(a))
Problem description
When I run the above code sample, I get output like this:
Python 3.5.5 |Anaconda, Inc.| (default, May 13 2018, 21:12:35)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a = "a"
>>> import sys
>>> print(sys.getsizeof(a))
50
>>> import pandas
>>> print(sys.getsizeof(a))
58
It looks like the import pandas
statement changes something about the built-in strings or sys.getsizeof
function to make it think single-character strings are 8-bytes longer.
Longer strings (e.g. "aa"
and "aaa"
etc) do not appear to be affected - we’ve tried with strings up to 100 characters long.
I’ve seen this with a small mix of Python and pandas versions, e.g. Python 3.4.3/Pandas 0.22.0 Python 3.7.0/Pandas 0.23.1
Running on 64-bit Ubuntu 16.04 LTS with all latest updates.
Expected Output
sys.getsizeof(a)
should return the same value before and after pandas
is imported.
Output of pd.show_versions()
This output is from the same interpreter that I used to get my Problem description output above.
INSTALLED VERSIONS
commit: None python: 3.5.5.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-130-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8
pandas: 0.23.1 pytest: None pip: 10.0.1 setuptools: 39.2.0 Cython: None numpy: 1.14.5 scipy: None pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
I’ve had a little spare time and written a trace to investigate further. I think @TomAugspurger meant to link to https://docs.python.org/3/library/sys.html#sys.settrace - here’s my trace script that creates a breakpoint when the value of
sys.getsizeof("a")
changes:This stopped at the following place.
which is line 312 of the
pandas/util/testing.py
module. Looking at lines 310 and 311 I discovered the true culprit:numpy
! It looks like whatever single-letter string is the first item of the list given tonp.array
, gets the “magic” extra bytes. Here’s a REPL sessions that demonstrates the issue with “b” instead of “a”:It always appeared to happen with
"a"
because of thenumpy
arrayRANDS_CHARS
thatpandas
defines which has"a"
at the start.Looks like this is no longer an issue on master. Suppose could use a test