BUG: read_stata ignores columns parameter and dtypes of empty dta files
See original GitHub issuePandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
# create an empty DataFrame with int64 and float64 dtypes
df = pd.DataFrame(data={"a": range(3), "b": [1.0, 2.0, 3.0]}).head(0)
# write to Stata .dta file
df.to_stata('empty.dta', write_index=False, version=117)
# read one column of empty .dta file
df2 = pd.read_stata('empty.dta', columns=["a"])
# show dtypes of df2
df2.dtypes
Issue Description
A stata .dta
file with zero rows still has type information, but when you try to read an empty .dta
file using pd.read_stata
all of the columns have object dtype. It will also ignore the columns
parameter and read all of the columns.
Expected Behavior
In the above example df2.dtypes
should return:
In [2]: df2.dtypes
Out[2]:
a object
b object
dtype: object
Installed Versions
Apologies, pd.show_versions()
fails for some reason. I’ve included it, but the pandas version is 1.4.1.
In [5]: pd.__version__
Out[5]: '1.4.1'
In [3]: pd.show_versions()
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Input In [3], in <module>
----> 1 pd.show_versions()
File ~/mambaforge/envs/py310/lib/python3.10/site-packages/pandas/util/_print_versions.py:109, in show_versions(as_json)
94 """
95 Provide useful information, important for bug reports.
96
(...)
106 * If True, outputs info in JSON format to the console.
107 """
108 sys_info = _get_sys_info()
--> 109 deps = _get_dependency_info()
111 if as_json:
112 j = {"system": sys_info, "dependencies": deps}
File ~/mambaforge/envs/py310/lib/python3.10/site-packages/pandas/util/_print_versions.py:88, in _get_dependency_info()
86 result: dict[str, JSONSerializable] = {}
87 for modname in deps:
---> 88 mod = import_optional_dependency(modname, errors="ignore")
89 result[modname] = get_version(mod) if mod else None
90 return result
File ~/mambaforge/envs/py310/lib/python3.10/site-packages/pandas/compat/_optional.py:126, in import_optional_dependency(name, extra, errors, min_version)
121 msg = (
122 f"Missing optional dependency '{install_name}'. {extra} "
123 f"Use pip or conda to install {install_name}."
124 )
125 try:
--> 126 module = importlib.import_module(name)
127 except ImportError:
128 if errors == "raise":
File ~/mambaforge/envs/py310/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
124 break
125 level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)
File <frozen importlib._bootstrap>:1050, in _gcd_import(name, package, level)
File <frozen importlib._bootstrap>:1027, in _find_and_load(name, import_)
File <frozen importlib._bootstrap>:1006, in _find_and_load_unlocked(name, import_)
File <frozen importlib._bootstrap>:688, in _load_unlocked(spec)
File <frozen importlib._bootstrap_external>:883, in exec_module(self, module)
File <frozen importlib._bootstrap>:241, in _call_with_frames_removed(f, *args, **kwds)
File ~/mambaforge/envs/py310/lib/python3.10/site-packages/setuptools/__init__.py:8, in <module>
5 import os
6 import re
----> 8 import _distutils_hack.override # noqa: F401
10 import distutils.core
11 from distutils.errors import DistutilsOptionError
File ~/mambaforge/envs/py310/lib/python3.10/site-packages/_distutils_hack/override.py:1, in <module>
----> 1 __import__('_distutils_hack').do_override()
File ~/mambaforge/envs/py310/lib/python3.10/site-packages/_distutils_hack/__init__.py:72, in do_override()
70 if enabled():
71 warn_distutils_present()
---> 72 ensure_local_distutils()
File ~/mambaforge/envs/py310/lib/python3.10/site-packages/_distutils_hack/__init__.py:59, in ensure_local_distutils()
57 # check that submodules load as expected
58 core = importlib.import_module('distutils.core')
---> 59 assert '_distutils' in core.__file__, core.__file__
60 assert 'setuptools._distutils.log' not in sys.modules
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
What's new in 1.0.0 (January 29, 2020)
The default bool data type based on a bool-dtype NumPy array, the column can only hold True or False , and not missing...
Read more >foreign: Read Data Stored by 'Minitab', 'S', 'SAS', 'SPSS', ' ...
This function is used to read in files in Octave text data format, as created by save -text in Octave. It knows about...
Read more >Welcome to pyreadstat's documentation! - GitHub Pages
Variable names must match variable names in the dataframe otherwise will be ignored. Value types must match the type of the column in...
Read more >Unable to read Stata value_labels from .dta-file created by ...
ghost commented on Jul 14, 2017 edited by ghost · Code Sample, a copy-pastable example if possible · Problem description · Expected Output...
Read more >apache_beam.dataframe.io module
Note that this parameter ignores commented lines and empty lines if ... dtype (Type name or dict of column -> type, optional) –...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think this occurs for other file types as well. I tried it for .csv and .xlsx file types and the same thing occurred.
The second part of this bug did not occur in these file types. So I am guessing it’s an issue with the method
@simonjayhawkins You’re right, I mixed it up because I was highlighting two separate issues with reading empty files:
pd.read_stata
ignores the columns parameter when reading an empty file.pd.read_stata
loses dtype information when reading an empty file.Here’s an updated example:
Reproducible Example
Expected Behavior
In the above example
pd.read_stata('empty.dta').dtypes
should return: