question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_csv fails with `TypeError: object cannot be converted to an IntegerDtype` yet succeeds when reading chunks

See original GitHub issue

Code Sample, a copy-pastable example if possible

Download this file upload.txt

# Your code here
import pandas as pd
from enum import Enum, IntEnum, auto
import argparse

# I attached the file in the github issue
filename = "upload.txt"
# this field is coded on 64 bits so 'UInt64' looks perfect.
column = "tcp.options.mptcp.sendkey"

with open(filename) as fd:

    print("READ CHUNK BY CHUNK")

    res = pd.read_csv(
            fd,
            comment='#',
            sep='|',
            dtype={column: 'UInt64' },
            usecols=[column],
            chunksize=1
    )
    for chunk in (res):
        # print("chunk %d" % i)
        print(chunk)



    fd.seek(0) # rewind

    print("READ THE WHOLE FILE AT ONCE ")
    res = pd.read_csv(
            fd,
            comment='#',
            sep='|',
            usecols=[column],
            dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
    )
    print(res)





If I read in chunks, read_csv succeeds, if I try to read the column at once, I get

Traceback (most recent call last):
  File "test2.py", line 34, in <module>
    dtype={"tcp.options.mptcp.sendkey": 'UInt64' }
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read
    data = parser.read(nrows)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
    ret = self._engine.read(nrows)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 900, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 992, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1124, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1155, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1235, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 308, in _from_sequence_of_strings
    return cls._from_sequence(scalars, dtype, copy)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 303, in _from_sequence
    return integer_array(scalars, dtype=dtype, copy=copy)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 111, in integer_array
    values, mask = coerce_to_array(values, dtype=dtype, copy=copy)
  File "/nix/store/mhiszrb8cpicjkzgraq796asj2sxpjch-python3.7-pandas-0.24.1/lib/python3.7/site-packages/pandas/core/arrays/integer.py", line 188, in coerce_to_array
    values.dtype))
TypeError: object cannot be converted to an IntegerDtype


Expected Output

I would like the call to read_csv to succeed without having to read in chunks (which seems to have other side effects as well).

Output of pd.show_versions()

I am using v0.23.4 with a patch from master to fix some other bug. [paste the output of ``pd.show_versions()`` here below this line] commit: None python: 3.7.2.final.0 python-bits: 64 OS: Linux OS-release: 4.19.0 machine: x86_64 processor: byteorder: little LC_ALL: None LANG: fr_FR.UTF-8 LOCALE: fr_FR.UTF-8

pandas: 0+unknown pytest: None pip: 18.1 setuptools: 40.6.3 Cython: None numpy: 1.16.0 scipy: 1.2.0 pyarrow: None xarray: None IPython: None sphinx: None patsy: None dateutil: 2.7.5 pytz: 2018.7 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.9 feather: None matplotlib: 3.0.2 openpyxl: 2.5.12 xlrd: 1.1.0 xlwt: 1.3.0 xlsxwriter: None lxml.etree: 4.2.6 bs4: 4.6.3 html5lib: 1.0.1 sqlalchemy: 1.2.14 pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:20 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
NumesSanguiscommented, Jun 25, 2020

Sorry that I have no time to properly debug this, but I hope I can contribute a little bit of knowledge.

I’m running into the same problem as OP when I read 1 of the sheets of a .xlsl file (pandas 0.24.2). There are NaN values, but from pandas 0.24 that should work when doing .astype(pd.Int16Dtype()) right?

This gave the same problem as OP:

df_sheet.age = df_sheet.age.astype(pd.Int16Dtype())

However, ugly, but this seemed to have worked for me:

df_sheet.age = df_sheet.age.astype('float')  # first convert to float before int
df_sheet.age = df_sheet.age.astype(pd.Int16Dtype())
1reaction
jrebackcommented, Oct 9, 2021

@alexreg you or anyone is welcome to submit a PR to patch and the core team can review

Read more comments on GitHub >

github_iconTop Results From Across the Web

TypeError: object cannot be converted to an IntegerDtype ...
It's known bug, as explained here. Workaround is to convert column first to float and than to Int32 . Make sure you strip...
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally return a pandas object.
Read more >
UnicodeDecodeError: 'utf-8' codec can't decode byte [...] in ...
Solving the UnicodeDecodeError when using Pandas' read_csv can be done in multiple ways. In this blog post, I list three.
Read more >
polars.read_csv — Polars documentation - GitHub Pages
By file-like object, we refer to objects with a read() method, such as a file handler ... If this does not succeed, the...
Read more >
object cannot be converted to an IntegerDtype-Pandas,Python
Coding example for the question TypeError: object cannot be converted to an IntegerDtype-Pandas,Python.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found