question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

can't read large stata file

See original GitHub issue

I am trying to read the panel dataset of Russian individuals in stata format. The dataset can be freely obtained at the rlms site.

ind_dta = pd.read_stata('USER_RLMS-HSE_IND_1994_2017_v2_eng.dta')

This results in memory error and that seems strange. Machine has 16gb of memory, the file is less than 4gb.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.6.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-46-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4 pytest: 4.0.0 pip: 18.1 setuptools: 40.6.2 Cython: 0.29 numpy: 1.15.4 scipy: 1.1.0 pyarrow: None xarray: None IPython: 7.1.1 sphinx: 1.8.2 patsy: 0.5.1 dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.8 feather: None matplotlib: 3.0.1 openpyxl: 2.5.9 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: 1.1.2 lxml: 4.2.5 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.14 pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
bashtagecommented, Mar 19, 2019

I have taken a look and the major isue is in replacing missing values. This data set has many columns with missing columns. ~2600 out of 2700. These are mostly integer columns, often byte (1 byte) which don’t require much storage. Converting these requires casting the values to doubles which requires 8 bytes/entry. This effectively blows up the dataset by a factor if 4ish (some columns have larger types), which makes it impractically big. I suppose the correct solution would be to use an extension type that supports the correct bit width and a missing value. This needs the extension type API to stabilize.

A side problem that is probably worth fixing is that the conversion of missing values is very slow. I did a quick hack that reduced the conversion time by a factor of about 1000.

For now, you can use the lower level StataReader and not convert missing values (you will need to handle them your self). That will get you past at least one problem.

0reactions
bashtagecommented, Mar 19, 2019

The other issue is that the labels are not unique. That is, 2 values in stata are getting the same lable. Pandas categoricals don’t support this. A work around:

from pandas.io.stata import StataReader
file_name = r'C:\temp\USER_RLMS-HSE_IND_1994_2017_v2_eng.dta'
sr = StataReader(file_name, convert_missing=False, chunksize=1000, convert_categoricals=False)
labels = sr.value_labels()  # To use later
for block in sr:
    temp = block
    break

You will then have to apply labels yourself, if you need them.

Read more comments on GitHub >

github_iconTop Results From Across the Web

FAQ: Dealing with very large datasets - Stata
How do you process very large datasets in Stata? ... Read into Stata the first file or segment: . use filename.dta. A unique...
Read more >
Import Large File - Statalist
Hello guys, I am new to this forum and hope you can help me with a problem I am facing with stata. I...
Read more >
read huge stata file into R studio - General - RStudio Community
I have a huge dataset in STATA that i want to read into R. This dataset has around 7 million observation and can't...
Read more >
R: How to quickly read large .dta files without RAM Limitations
A solution is found via Using memisc to import stata . dta file into R but this assumes RAM is scarce.
Read more >
Stata I/O with very large files
If the appended files are too big for available memory you can't use: ... Our fairly ordinary Linux boxes can read 3.4 million...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found