Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_json ignores dictionary as dtype

See original GitHub issue

Code Sample, a copy-pastable example if possible

dtypes = {
    'created': 'int64',
    'eventType' : 'category',
    'severity' : 'category'
    }

df = pd.read_json('dataset.json', lines=True, dtype=dtypes)
df.info()

Results into:

created          int64
eventType        object
severity         object

Using .astype() instead converts types correctly:

df.astype(dtypes).info()
created          int64
eventType        category
severity         category

Problem description

Should take take appropriate data type during DataFrame loading from disc.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.5.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None

pandas           : 0.25.3
numpy            : 1.17.4
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 41.2.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : 0.4.0
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.11.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.1.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.16.0
pytables         : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : None
tables           : None
xarray           : None
xlrd             : 1.2.0
xlwt             : None
xlsxwriter       : None

Issue Analytics

State:
Created 3 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

dtriznacommented, Apr 8, 2020

Try it out with following json file:

> type test.json
{"created": 1585669938386, "eventType": "TEST", "severity": "INFO"}
{"created": 1585669938387, "eventType": "TEST2", "severity": "INFO"}

In case of using these dtype key word argment during read_json - pandas just ignores this setting (note data types are “object”, not “category” as specified in dtypes dictonary.

>>> import pandas as pd
>>> dtypes = {
...     'created': 'int64',
...     'eventType' : 'category',
...     'severity' : 'category'
...     }
>>> a = pd.read_json('test.json', lines=True, dtype=dtypes)
>>> a.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
created      2 non-null int64
eventType    2 non-null object
severity     2 non-null object
dtypes: int64(1), object(2)
memory usage: 176.0+ bytes

If we use same dtypes dictionary on DataFrame’s astype method - setting is applied (note correct data types):

>>> a.astype(dtypes).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
created      2 non-null int64
eventType    2 non-null category
severity     2 non-null category
dtypes: category(2), int64(1)
memory usage: 332.0 bytes

This raises problems with large datasets, when reading data with correct types decrease usage of RAM drasticly.

0reactions

jake9wicommented, May 19, 2021

Has there been any updates to this. I am experiencing this issue with pandas 1.2.4.