question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: `read_stata` always uses 'utf8'

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576)
for chunk in data:
    pass # do something with chunk (never reached)

This raises UnicodeDecodeError: 'utf8' codec can't decode byte 0x?? in position ?: invalid start byte. OK. So the file isn’t a utf8 one. Even though the StataReader doesn’t specify any Unicode support; I then try and open it with a latin-1 encoding:

import pandas
data = pandas.read_stata(file_with_latin1_encoding, chunksize=1048576, encoding='latin-1')
for chunk in data:
    pass # do something with chunk (never reached)

This raises the same exception at exactly the same place (still utf-8).

Problem description

This is a problem because it appears that read_stata doesn’t honour the encoding argument. I think this line introduced a bug. The StataReader doesn’t manage any other type of data than ascii or latin-1.

Changing the line 1338 of the pandas.io.stata module:

        return s.decode('utf-8')

to:

        return s.decode('latin-1')

Seemed to make everything work and I could read the data from the given file. Even better, changing it to the following:

        return s.decode(self._encoding or self._default_encoding)

also seems to have made it work.

I believe though, that if you want to make this work with Unicode too you’d have to add the following encodings to VALID_ENCODINGS: utf-8, utf8, iso10646.

Expected Output

The file should be correctly read and parsed

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: ro_RO.UTF-8 LANG: ro_RO.UTF-8 LOCALE: None.None

pandas: 0.24.0.dev0+41.gb2eec25 pytest: 3.2.3 pip: 9.0.3 setuptools: 36.6.0 Cython: 0.28.2 numpy: 1.13.3 scipy: 1.0.0 pyarrow: None xarray: None IPython: 5.1.0 sphinx: 1.6.3 patsy: None dateutil: 2.7.3 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: 2.4.9 xlrd: 1.0.0 xlwt: 1.3.0 xlsxwriter: None lxml: 3.8.0 bs4: None html5lib: 0.999999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:2
  • Comments:25 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
hudcapcommented, Feb 24, 2019

I am still having issues with this. I’m using a 118 Stata file, and I’m getting the same UnicodeDecodeError. When I edit the stata.py file to use latin-1 as per @adrian-castravete, everything works.

0reactions
leolovethewayyouliecommented, Mar 11, 2020
import pandas as pd
pd.read_stata("data.dta")

Haha, thank you so much dude, since I install the newest version, it worked although it still has the warning but I guess it’s alright @@ Thank you so much ❤️❤️❤️❤️❤️

Read more comments on GitHub >

github_iconTop Results From Across the Web

('utf-8' codec) while reading a dta file in Pandas - Stack Overflow
I'm using Python 2.7 on Ubuntu 14.04 in case that matters. python · pandas · utf-8 · stata · Share.
Read more >
In MySQL, never use “utf8”. Use “utf8mb4”. | by Adam Hooper
Today's bug: I tried to store a UTF-8 string in a MariaDB “utf8”-encoded database, and Rails raised a bizarre ... Always use “utf8mb4”...
Read more >
Always use UTF-8 collations to read UTF-8 text in serverless ...
Always use UTF-8 collations to read UTF-8 text in serverless SQL pool ... This behavior might cause unexpected text conversion error.
Read more >
Problem with Pandas read_csv always trying to read as UTF-8
I'm trying to read in a CSV file using pandas.read\_file(report), but I'm hitting this error message: 'utf-8' codec can't decode byte 0x93 ......
Read more >
1327893 – Outputing UTF8 characters from keystone client ...
Bug 1327893 - Outputing UTF8 characters from keystone client when using CSV ... Always. What information can you provide around timeframes and urgency?...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found