Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: read_stata ignores columns parameter and dtypes of empty dta files

See original GitHub issue

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

# create an empty DataFrame with int64 and float64 dtypes
df = pd.DataFrame(data={"a": range(3), "b": [1.0, 2.0, 3.0]}).head(0)

# write to Stata .dta file
df.to_stata('empty.dta', write_index=False, version=117)

# read one column of empty .dta file
df2 = pd.read_stata('empty.dta', columns=["a"])

# show dtypes of df2
df2.dtypes

Issue Description

A stata .dta file with zero rows still has type information, but when you try to read an empty .dta file using pd.read_stata all of the columns have object dtype. It will also ignore the columns parameter and read all of the columns.

Expected Behavior

In the above example df2.dtypes should return:

In [2]: df2.dtypes
Out[2]:
a    object
b    object
dtype: object

Installed Versions

Apologies, pd.show_versions() fails for some reason. I’ve included it, but the pandas version is 1.4.1.

In [5]: pd.__version__
Out[5]: '1.4.1'

In [3]: pd.show_versions()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [3], in <module>
----> 1 pd.show_versions()

File ~/mambaforge/envs/py310/lib/python3.10/site-packages/pandas/util/_print_versions.py:109, in show_versions(as_json)
     94 """
     95 Provide useful information, important for bug reports.
     96
   (...)
    106     * If True, outputs info in JSON format to the console.
    107 """
    108 sys_info = _get_sys_info()
--> 109 deps = _get_dependency_info()
    111 if as_json:
    112     j = {"system": sys_info, "dependencies": deps}

File ~/mambaforge/envs/py310/lib/python3.10/site-packages/pandas/util/_print_versions.py:88, in _get_dependency_info()
     86 result: dict[str, JSONSerializable] = {}
     87 for modname in deps:
---> 88     mod = import_optional_dependency(modname, errors="ignore")
     89     result[modname] = get_version(mod) if mod else None
     90 return result

File ~/mambaforge/envs/py310/lib/python3.10/site-packages/pandas/compat/_optional.py:126, in import_optional_dependency(name, extra, errors, min_version)
    121 msg = (
    122     f"Missing optional dependency '{install_name}'. {extra} "
    123     f"Use pip or conda to install {install_name}."
    124 )
    125 try:
--> 126     module = importlib.import_module(name)
    127 except ImportError:
    128     if errors == "raise":

File ~/mambaforge/envs/py310/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    124             break
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1006, in _find_and_load_unlocked(name, import_)

File <frozen importlib._bootstrap>:688, in _load_unlocked(spec)

File <frozen importlib._bootstrap_external>:883, in exec_module(self, module)

File <frozen importlib._bootstrap>:241, in _call_with_frames_removed(f, *args, **kwds)

File ~/mambaforge/envs/py310/lib/python3.10/site-packages/setuptools/__init__.py:8, in <module>
      5 import os
      6 import re
----> 8 import _distutils_hack.override  # noqa: F401
     10 import distutils.core
     11 from distutils.errors import DistutilsOptionError

File ~/mambaforge/envs/py310/lib/python3.10/site-packages/_distutils_hack/override.py:1, in <module>
----> 1 __import__('_distutils_hack').do_override()

File ~/mambaforge/envs/py310/lib/python3.10/site-packages/_distutils_hack/__init__.py:72, in do_override()
     70 if enabled():
     71     warn_distutils_present()
---> 72     ensure_local_distutils()

File ~/mambaforge/envs/py310/lib/python3.10/site-packages/_distutils_hack/__init__.py:59, in ensure_local_distutils()
     57 # check that submodules load as expected
     58 core = importlib.import_module('distutils.core')
---> 59 assert '_distutils' in core.__file__, core.__file__
     60 assert 'setuptools._distutils.log' not in sys.modules

Issue Analytics

State:
Created 2 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

Pydarecommented, Mar 8, 2022

A stata .dta file with zero rows still has type information, but when you try to read an empty .dta file using pd.read_stata all of the columns have object type.

I think this occurs for other file types as well. I tried it for .csv and .xlsx file types and the same thing occurred.

The second part of this bug did not occur in these file types. So I am guessing it’s an issue with the method

0reactions

sterlinmcommented, Jul 8, 2022

@simonjayhawkins You’re right, I mixed it up because I was highlighting two separate issues with reading empty files:

pd.read_stata ignores the columns parameter when reading an empty file.
pd.read_stata loses dtype information when reading an empty file.

Here’s an updated example:

Reproducible Example

import numpy as np
import pandas as pd
from pandas.io.stata import StataReader

# create a DataFrame with int32 and float64 dtypes
df = pd.DataFrame(data={"a": range(3), "b": [1.0, 2.0, 3.0]})
df.loc[:, 'a'] = df['a'].astype('int32')
df_empty = df.head(0)

# write the empty and non-empty DataFrame's to .dta files
df.to_stata('nonempty.dta', write_index=False, version=117)
df_empty.to_stata('empty.dta', write_index=False, version=117)

# column variables
expected_cols = pd.Index(['a'])
all_cols = df.columns

# reading one column of non-empty .dta file works
assert pd.read_stata('nonempty.dta', columns=["a"]).columns.equals(expected_cols)

# reading one column of empty .dta file does not work
assert pd.read_stata('empty.dta', columns=["a"]).columns.equals(all_cols)
assert pd.read_stata('empty.dta', columns=["xyz"]).columns.equals(all_cols)  # should raise error

# reading non-empty .dta file retains correct dtypes
assert pd.read_stata('nonempty.dta').dtypes.equals(df.dtypes)

# reading empty .dta file makes all the columns object columns
assert (pd.read_stata('empty.dta').dtypes == 'object').all()

# we can confirm that the empty .dta file does retain the type information
expected_dtyplist = [np.dtype('int32'), np.dtype('float64')]
assert StataReader('nonempty.dta').dtyplist == expected_dtyplist
assert StataReader('empty.dta').dtyplist == expected_dtyplist

Expected Behavior

In the above example pd.read_stata('empty.dta').dtypes should return:

In [2]: df2.dtypes
Out[2]:
a    int32
b    float64
dtype: object

Top Results From Across the Web

What's new in 1.0.0 (January 29, 2020)

The default bool data type based on a bool-dtype NumPy array, the column can only hold True or False , and not missing...

foreign: Read Data Stored by 'Minitab', 'S', 'SAS', 'SPSS', ' ...

This function is used to read in files in Octave text data format, as created by save -text in Octave. It knows about...

Welcome to pyreadstat's documentation! - GitHub Pages

Variable names must match variable names in the dataframe otherwise will be ignored. Value types must match the type of the column in...

Unable to read Stata value_labels from .dta-file created by ...

ghost commented on Jul 14, 2017 edited by ghost · Code Sample, a copy-pastable example if possible · Problem description · Expected Output...

apache_beam.dataframe.io module

Note that this parameter ignores commented lines and empty lines if ... dtype (Type name or dict of column -> type, optional) –...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

BUG: read_stata ignores columns parameter and dtypes of empty dta files

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

Issue Analytics

Top GitHub Comments

Reproducible Example

Expected Behavior

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

BUG: Small numbers in scientific notation do not show the right significant figures in DataFrame.

BUG: outer join out of order when joining multiple DataFrames