read_csv fails with `TypeError: object cannot be converted to an IntegerDtype` yet succeeds when reading chunks
See original GitHub issueCode Sample, a copy-pastable example if possible
Download this file upload.txt
# Your code here
import pandas as pd
from enum import Enum, IntEnum, auto
import argparse
# I attached the file in the github issue
filename = "upload.txt"
# this field is coded on 64 bits so 'UInt64' looks perfect.
column = "tcp.options.mptcp.sendkey"
with open(filename) as fd:
print("READ CHUNK BY CHUNK")
res = pd.read_csv(
fd,
comment='#',
sep='|',
dtype={column: 'UInt64' },
usecols=[column],
chunksize=1
)
for chunk in (res):
# print("chunk %d" % i)
print(chunk)
fd.seek(0) # rewind
print("READ THE WHOLE FILE AT ONCE ")
res = pd.read_csv(
fd,
comment='#',
sep='|',
usecols=[column],
dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
)
print(res)
If I read in chunks, read_csv succeeds, if I try to read the column at once, I get
Traceback (most recent call last):
File "test2.py", line 34, in <module>
dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read
data = parser.read(nrows)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 900, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 992, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1124, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1155, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas/_libs/parsers.pyx", line 1235, in pandas._libs.parsers.TextReader._convert_with_dtype
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 308, in _from_sequence_of_strings
return cls._from_sequence(scalars, dtype, copy)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 303, in _from_sequence
return integer_array(scalars, dtype=dtype, copy=copy)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 111, in integer_array
values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 188, in coerce_to_array
values.dtype))
TypeError: object cannot be converted to an IntegerDtype
Expected Output
I would like the call to read_csv to succeed without having to read in chunks (which seems to have other side effects as well).
Output of pd.show_versions()
pandas: 0+unknown pytest: None pip: 18.1 setuptools: 40.6.3 Cython: None numpy: 1.16.0 scipy: 1.2.0 pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.9 feather: None matplotlib: 3.0.2 openpyxl: 2.5.12 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml.etree: 4.2.6 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.14 pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None
Issue Analytics
- State:
- Created 5 years ago
- Comments:20 (10 by maintainers)
Sorry that I have no time to properly debug this, but I hope I can contribute a little bit of knowledge.
I’m running into the same problem as OP when I read 1 of the sheets of a .xlsl file (
pandas 0.24.2
). There are NaN values, but from pandas 0.24 that should work when doing.astype(pd.Int16Dtype())
right?This gave the same problem as OP:
However, ugly, but this seemed to have worked for me:
@alexreg you or anyone is welcome to submit a PR to patch and the core team can review