Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: `read_stata` always uses 'utf8'

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576)
for chunk in data:
    pass # do something with chunk (never reached)

This raises UnicodeDecodeError: 'utf8' codec can't decode byte 0x?? in position ?: invalid start byte. OK. So the file isn’t a utf8 one. Even though the StataReader doesn’t specify any Unicode support; I then try and open it with a latin-1 encoding:

import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576, encoding='latin-1')
for chunk in data:
    pass # do something with chunk (never reached)

This raises the same exception at exactly the same place (still utf-8).

Problem description

This is a problem because it appears that read_stata doesn’t honour the encoding argument. I think this line introduced a bug. The StataReader doesn’t manage any other type of data than ascii or latin-1.

Changing the line 1338 of the pandas.io.stata module:

        return s.decode('utf-8')

to:

        return s.decode('latin-1')

Seemed to make everything work and I could read the data from the given file. Even better, changing it to the following:

        return s.decode(self._encoding or self._default_encoding)

also seems to have made it work.

I believe though, that if you want to make this work with Unicode too you’d have to add the following encodings to VALID_ENCODINGS: utf-8, utf8, iso10646.

Expected Output

The file should be correctly read and parsed

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: ro_RO.UTF-8 LANG: ro_RO.UTF-8 LOCALE: None.None

pandas: 0.24.0.dev0+41.gb2eec25 pytest: 3.2.3 pip: 9.0.3 setuptools: 36.6.0 Cython: 0.28.2 numpy: 1.13.3 scipy: 1.0.0 pyarrow: None xarray: None IPython: 5.1.0 sphinx: 1.6.3 patsy: None dateutil: 2.7.3 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: 2.4.9 xlrd: 1.0.0 xlwt: 1.3.0 xlsxwriter: None lxml: 3.8.0 bs4: None html5lib: 0.999999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:25 (12 by maintainers)

Top GitHub Comments

1reaction

hudcapcommented, Feb 24, 2019

I am still having issues with this. I’m using a 118 Stata file, and I’m getting the same UnicodeDecodeError. When I edit the stata.py file to use latin-1 as per @adrian-castravete, everything works.

0reactions

leolovethewayyouliecommented, Mar 11, 2020

import pandas as pd
pd.read_stata("data.dta")

Haha, thank you so much dude, since I install the newest version, it worked although it still has the warning but I guess it’s alright @@ Thank you so much ❤️❤️❤️❤️❤️

Top Results From Across the Web

('utf-8' codec) while reading a dta file in Pandas - Stack Overflow

I'm using Python 2.7 on Ubuntu 14.04 in case that matters. python · pandas · utf-8 · stata · Share.

In MySQL, never use “utf8”. Use “utf8mb4”. | by Adam Hooper

Today's bug: I tried to store a UTF-8 string in a MariaDB “utf8”-encoded database, and Rails raised a bizarre ... Always use “utf8mb4”...

Always use UTF-8 collations to read UTF-8 text in serverless ...

Always use UTF-8 collations to read UTF-8 text in serverless SQL pool ... This behavior might cause unexpected text conversion error.

Problem with Pandas read_csv always trying to read as UTF-8

I'm trying to read in a CSV file using pandas.read\_file(report), but I'm hitting this error message: 'utf-8' codec can't decode byte 0x93 ......

1327893 – Outputing UTF8 characters from keystone client ...

Bug 1327893 - Outputing UTF8 characters from keystone client when using CSV ... Always. What information can you provide around timeframes and urgency?...