question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_json ignores dictionary as dtype

See original GitHub issue

Code Sample, a copy-pastable example if possible

dtypes = {
    'created': 'int64',
    'eventType' : 'category',
    'severity' : 'category'
    }

df = pd.read_json('dataset.json', lines=True, dtype=dtypes)
df.info()

Results into:

created          int64
eventType        object
severity         object

Using .astype() instead converts types correctly:

df.astype(dtypes).info()
created          int64
eventType        category
severity         category

Problem description

Should take take appropriate data type during DataFrame loading from disc.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.5.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None

pandas           : 0.25.3
numpy            : 1.17.4
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 41.2.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : 0.4.0
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.10.3
IPython          : 7.11.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.1.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.16.0
pytables         : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : None
tables           : None
xarray           : None
xlrd             : 1.2.0
xlwt             : None
xlsxwriter       : None

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
dtriznacommented, Apr 8, 2020

Try it out with following json file:

> type test.json
{"created": 1585669938386, "eventType": "TEST", "severity": "INFO"}
{"created": 1585669938387, "eventType": "TEST2", "severity": "INFO"}

In case of using these dtype key word argment during read_json - pandas just ignores this setting (note data types are “object”, not “category” as specified in dtypes dictonary.

>>> import pandas as pd
>>> dtypes = {
...     'created': 'int64',
...     'eventType' : 'category',
...     'severity' : 'category'
...     }
>>> a = pd.read_json('test.json', lines=True, dtype=dtypes)
>>> a.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
created      2 non-null int64
eventType    2 non-null object
severity     2 non-null object
dtypes: int64(1), object(2)
memory usage: 176.0+ bytes

If we use same dtypes dictionary on DataFrame’s astype method - setting is applied (note correct data types):

>>> a.astype(dtypes).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
created      2 non-null int64
eventType    2 non-null category
severity     2 non-null category
dtypes: category(2), int64(1)
memory usage: 332.0 bytes

This raises problems with large datasets, when reading data with correct types decrease usage of RAM drasticly.

0reactions
jake9wicommented, May 19, 2021

Has there been any updates to this. I am experiencing this issue with pandas 1.2.4.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to read a json-dictionary type file - Stack Overflow
The json method doesnt work as the json file is not in the format it expects. As we can easily load a json...
Read more >
pandas.json_normalize — pandas 1.5.2 documentation
Configures error handling. 'ignore' : will ignore KeyError if keys listed in meta are not always present. 'raise' : will raise KeyError ...
Read more >
Different Ways to Change Data Type in pandas
While working in Pandas DataFrame or any table-like data structures we are often required to chang the data type(dtype) of a column also...
Read more >
Python | Pandas DataFrame.astype() - GeeksforGeeks
errors : Control raising of exceptions on invalid data for provided dtype. raise : allow exceptions to be raised ignore : suppress exceptions....
Read more >
Pandas DataFrame astype() Method - W3Schools
dtype, data type, or a dictionary with data types for each column: ... Default 'raise'. Specifies whether to ignore errors or raise an...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found