Empty cells make Padas use float, even if read_csv(dtype={'FOO': str}) is used
See original GitHub issueCode Sample, a copy-pastable example if possible
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import pandas as pd
csv_path = 'test.csv'
df = pd.read_csv(csv_path, delimiter=';', quotechar='"',
decimal=',', encoding="ISO-8859-1", dtype={'FOO': str})
df.FOO = df.FOO.map(lambda n: n.zfill(6))
print(df)
test.csv:
FOO;BAR
01,23;4,56
1,23;45,6
;987
Problem description
When I use dtype={'FOO': str}
, I expect pandas to treat the column as a string. This seems to work, but when an empty cell is present Pandas seems to switch to float.
Expected Output
FOO BAR
0 001,23 4.56
1 001,23 45.60
2 000000 987.00
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.5.2.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-35-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.20.3 pytest: 3.2.2 pip: 9.0.1 setuptools: 20.7.0 Cython: None numpy: 1.13.3 scipy: 0.19.0 xarray: None IPython: 6.2.1 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0b10 sqlalchemy: 1.1.14 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Reactions:3
- Comments:10 (6 by maintainers)
Top GitHub Comments
I don’t directly find another related issue, apart from https://github.com/pandas-dev/pandas/issues/1450, which you can actually do as well: add
na_values=[], keep_default_na=False
toread_csv
if you want to prevent the parsing of empty strings to NaNs.@MartinThoma If you look at the values of the column, you will see pandas correctly preserved the data as strings (as you specified with
dtype={'FOO': str}
):The only ‘gotcha’ is that empty strings are still seen as missing values (and thus converted to NaN), and not kept as an empty string.
So your solution of filling the missing values with empty string (
df.FOO.fillna(value="")
) is actually fine.