can't read large stata file
See original GitHub issueI am trying to read the panel dataset of Russian individuals in stata format. The dataset can be freely obtained at the rlms site.
ind_dta = pd.read_stata('USER_RLMS-HSE_IND_1994_2017_v2_eng.dta')
This results in memory error and that seems strange. Machine has 16gb of memory, the file is less than 4gb.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.6.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-46-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.23.4 pytest: 4.0.0 pip: 18.1 setuptools: 40.6.2 Cython: 0.29 numpy: 1.15.4 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.1.1 sphinx: 1.8.2 patsy: 0.5.1 dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.8 feather: None matplotlib: 3.0.1 openpyxl: 2.5.9 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.1.2 lxml: 4.2.5 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.14 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (5 by maintainers)
I have taken a look and the major isue is in replacing missing values. This data set has many columns with missing columns. ~2600 out of 2700. These are mostly integer columns, often byte (1 byte) which don’t require much storage. Converting these requires casting the values to doubles which requires 8 bytes/entry. This effectively blows up the dataset by a factor if 4ish (some columns have larger types), which makes it impractically big. I suppose the correct solution would be to use an extension type that supports the correct bit width and a missing value. This needs the extension type API to stabilize.
A side problem that is probably worth fixing is that the conversion of missing values is very slow. I did a quick hack that reduced the conversion time by a factor of about 1000.
For now, you can use the lower level StataReader and not convert missing values (you will need to handle them your self). That will get you past at least one problem.
The other issue is that the labels are not unique. That is, 2 values in stata are getting the same lable. Pandas categoricals don’t support this. A work around:
You will then have to apply labels yourself, if you need them.