Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

can't read large stata file

See original GitHub issue

I am trying to read the panel dataset of Russian individuals in stata format. The dataset can be freely obtained at the rlms site.

ind_dta = pd.read_stata('USER_RLMS-HSE_IND_1994_2017_v2_eng.dta')

This results in memory error and that seems strange. Machine has 16gb of memory, the file is less than 4gb.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 3.6.6.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-46-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4 pytest: 4.0.0 pip: 18.1 setuptools: 40.6.2 Cython: 0.29 numpy: 1.15.4 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.1.1 sphinx: 1.8.2 patsy: 0.5.1 dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.8 feather: None matplotlib: 3.0.1 openpyxl: 2.5.9 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.1.2 lxml: 4.2.5 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.14 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 5 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

bashtagecommented, Mar 19, 2019

I have taken a look and the major isue is in replacing missing values. This data set has many columns with missing columns. ~2600 out of 2700. These are mostly integer columns, often byte (1 byte) which don’t require much storage. Converting these requires casting the values to doubles which requires 8 bytes/entry. This effectively blows up the dataset by a factor if 4ish (some columns have larger types), which makes it impractically big. I suppose the correct solution would be to use an extension type that supports the correct bit width and a missing value. This needs the extension type API to stabilize.

A side problem that is probably worth fixing is that the conversion of missing values is very slow. I did a quick hack that reduced the conversion time by a factor of about 1000.

For now, you can use the lower level StataReader and not convert missing values (you will need to handle them your self). That will get you past at least one problem.

0reactions

bashtagecommented, Mar 19, 2019

The other issue is that the labels are not unique. That is, 2 values in stata are getting the same lable. Pandas categoricals don’t support this. A work around:

from pandas.io.stata import StataReader
file_name = r'C:\temp\USER_RLMS-HSE_IND_1994_2017_v2_eng.dta'
sr = StataReader(file_name, convert_missing=False, chunksize=1000, convert_categoricals=False)
labels = sr.value_labels()  # To use later
for block in sr:
    temp = block
    break

You will then have to apply labels yourself, if you need them.