Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_json reads large integers as strings incorrectly if dtype not explicitly mentioned

See original GitHub issue

Code Sample (Original Problem)

json_content="""
{ 
    "1": {
        "tid": "9999999999999998", 
    }, 
    "2": {
        "tid": "9999999999999999", 
    },
    "3": {
        "tid": "10000000000000001", 
    },
    "4": {
        "tid": "10000000000000002", 
    }
}
"""
df=pd.read_json(json_content,
                orient='index', # read as transposed
                convert_axes=False, # don't convert keys to dates
        )
print(df.info())
print(df)

Problem description

I’m using pandas to load json data, but found some strange behaviour in the read_json function. In the above code, the integers as strings aren’t read correctly, though there shouldn’t be an overflow case as the values are well within the integer range.

It is reading correctly on explictly specifying the argument dtype=int, but I don’t understand why. What changes when we specify the dtype?

Corresponding SO discussion here:

Current Output

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid    4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2  10000000000000000
3  10000000000000000
4  10000000000000002

Expected Output

The tid’s should have been stored correctly.

None
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002

A minimal pytest example

import pytest
import numpy as np
import pandas as pd
from pandas import (Series, DataFrame, DatetimeIndex, Timestamp, read_json, compat)
from pandas.util import testing as tm

@pytest.mark.parametrize('dtype', ['int'])
def test_large_ints_from_json_strings(dtype):
    # GH 20608
    df1 = DataFrame([9999999999999999,10000000000000001],columns=['tid'])
    df_temp = df1.copy().astype(str)
    df2 = read_json(df_temp.to_json())
    assert (df1 == df2).all()[0] == True # currently False

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.13.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_IN LOCALE: en_IN.ISO8859-1

pandas: 0.22.0 pytest: None pip: 9.0.1 setuptools: 39.0.1 Cython: None numpy: 1.14.0 scipy: None pyarrow: None xarray: None IPython: 5.1.0 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.1.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 5 years ago
Comments:17 (7 by maintainers)

Top GitHub Comments

1reaction

Udayraj123commented, Apr 4, 2018

Sure! Though I’m new to source code of pandas, but I can try 😃

0reactions

ranieri-negricommented, Jul 15, 2022

Top Results From Across the Web

pandas read_json reads large integers as strings incorrectly

It is reading correctly on explictly specifying the argument dtype=int , but I don't understand why. What changes when we specify the dtype?...

pandas read_json reads large integers as strings incorrectly ...

Coding example for the question pandas read_json reads large integers as strings incorrectly-Pandas,Python.

pandas.read_json — pandas 1.5.2 documentation

pandas.read_json(path_or_buf, *, orient=None, typ='frame', dtype=None, convert_axes=None, convert_dates=True, ... Indication of expected JSON string format.

Part 3 - Introduction to Pandas | ArcGIS API for Python

An integer Index is used if not specified. pd.DataFrame(np.random.rand(4, 2), ... read_json() can be used to read JSON (JavaScript Object Notation) files.

apache_beam.dataframe.io module - Apache Beam

Read a comma-separated values (csv) file into DataFrame. ... If keep_default_na is False, and na_values are not specified, no strings will be parsed...