question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trouble writing to_stata with a GzipFile

See original GitHub issue

Problem description

When a Stata dataset writing to a GzipFile, the written dataset is all zero/blank. I think the Pandas would ideally write out the correct information to the GzipFile Stata output, or if that’s not an easy change, might consider raising an error when the user tries to write to a GzipFile.

Expected Output

I expected to read back the same data I tried to write, or to get an error when writing.

Here’s the table I tried to write to the GzipFile (df in the code):

a b c
1 1.5 “z”

Here’s the table that gets read back (df_from_gzip in the code):

a b c
0 0.0 “”

I think this is an error in writing, rather than in reading back, because Stata reads the same all-zeros table.

Code Sample


import pandas as pd
import gzip
import subprocess


df = pd.DataFrame({
    'a': [1],
    'b': [1.5],
    'c': ["z"]})

# Use GzipFile to write a compressed version:
with gzip.GzipFile("test_gz.dta.gz", mode = "wb") as f:
    df.to_stata(f, write_index = False)

# Use the system gunzip to extract (using GzipFile fails; see attempt below)
subprocess.run(["gunzip", "--keep", "test_gz.dta.gz"])
df_from_gzip = pd.read_stata("test_gz.dta")

print(df)
print(df_from_gzip)

Other fun facts

  • bz2.BZ2File and lzma.LZMAFile refuse to write dta files, with the error “UnsupportedOperation: Seeking is only supported on files open for reading”
  • Everything works for feather files.
  • This isn’t an issue with read_stata; opening the files in Stata itself gives the same results.
  • Variable types are retained.
  • Value labels for categorical variables are written correctly.
  • The number of rows is correct, even for larger examples.
  • Reading a system-compressed Stata file is fine.
import bz2
import lzma


# Try to read the compressed file created before -- fails with the message
# "Not a gzipped file (b'\x01\x00')". I'm not sure why, but it's not central
# to this issue.
with gzip.GzipFile("test_gz.dta.gz") as f:
    df2 = pd.read_stata(f)

    
# Writing feather files to these compressed connections works:
with gzip.GzipFile("test_gz.feather.gz", mode = "wb") as f:
    df.to_feather(f)
with bz2.BZ2File("test_bz.feather.bz2", mode = "wb") as f:
    df.to_feather(f)
with lzma.LZMAFile("test_xz.feather.xz", mode = "wb") as f:
    df.to_feather(f)
        

# Next, writing stata files with other compressors fails because the
# file isn't open for reading.
with bz2.BZ2File("test_bz.dta.bz", mode = "wb") as f:
    df.to_stata(f)  # this raises an error
with lzma.LZMAFile("test_xz.dta.xz", mode = "wb") as f:
    df.to_stata(f)  # this also raises an error


# But reading a system-compressed Stata file works:
df.to_stata("test.dta", write_index = False)
subprocess.run(["gzip", "test.dta"])
with gzip.GzipFile("test.dta.gz") as f:
    assert all(pd.read_stata(f) == df)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-20-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.22.0 pytest: None pip: 9.0.3 setuptools: 39.1.0 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: 0.9.0 xarray: None IPython: 6.4.0 sphinx: None patsy: 0.5.0 dateutil: 2.7.2 pytz: 2018.4 blosc: None bottleneck: None tables: 3.4.3 numexpr: 2.6.5 feather: 0.4.0 matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, May 15, 2018

Thanks! Could you narrow down your example to a minimal example? It’s hard to see exactly what the problem is with that long of an input. http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

0reactions
bashtagecommented, May 22, 2018

Could use something like:

with gzip.GzipFile('test.dta.gz','wb') as gz, tempfile.NamedTemporaryFile(delete=False) as ntf:
    df.to_stata(ntf)
    with open(ntf.name,'rb') as ntf2:
        gz.write(ntf2.read())

with current Pandas.

The patch fixes this issue so that a standard gzip can be used. It should be in 0.23.1

Read more comments on GitHub >

github_iconTop Results From Across the Web

Writing text to gzip file - python - Stack Overflow
gz file is to open it in binary mode and write the string as is: import gzip with gzip. open('file. gz', 'wb') as...
Read more >
gzip — Support for gzip files — Python 3.11.1 documentation
The GzipFile class reads and writes gzip-format files, ... TextIOWrapper instance with the specified encoding, error handling behavior, and line ending(s).
Read more >
pandas.DataFrame.to_stata — pandas 1.5.2 documentation
Write the index to Stata dataset. ... Can also be a dict with key 'method' set to one of { 'zip' , 'gzip'...
Read more >
IO::Compress::Gzip - Write RFC 1952 files/buffers
It returns an IO::Compress::Gzip object on success and undef on failure. The variable $GzipError will contain an error message on failure. If you...
Read more >
Python Support for gzip files (gzip) - Tutorialspoint
decompress(). This function decompresses the byte object and returns uncompressed data. Following example creates a gzip file by writing ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found