question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_json reads large integers as strings incorrectly if dtype not explicitly mentioned

See original GitHub issue

Code Sample (Original Problem)

json_content="""
{ 
    "1": {
        "tid": "9999999999999998", 
    }, 
    "2": {
        "tid": "9999999999999999", 
    },
    "3": {
        "tid": "10000000000000001", 
    },
    "4": {
        "tid": "10000000000000002", 
    }
}
"""
df=pd.read_json(json_content,
                orient='index', # read as transposed
                convert_axes=False, # don't convert keys to dates
        )
print(df.info())
print(df)

Problem description

I’m using pandas to load json data, but found some strange behaviour in the read_json function. In the above code, the integers as strings aren’t read correctly, though there shouldn’t be an overflow case as the values are well within the integer range.

It is reading correctly on explictly specifying the argument dtype=int, but I don’t understand why. What changes when we specify the dtype?

Corresponding SO discussion here:

Current Output

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid    4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2  10000000000000000
3  10000000000000000
4  10000000000000002

Expected Output

The tid’s should have been stored correctly.

None
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002

A minimal pytest example

import pytest
import numpy as np
import pandas as pd
from pandas import (Series, DataFrame, DatetimeIndex, Timestamp, read_json, compat)
from pandas.util import testing as tm

@pytest.mark.parametrize('dtype', ['int'])
def test_large_ints_from_json_strings(dtype):
    # GH 20608
    df1 = DataFrame([9999999999999999,10000000000000001],columns=['tid'])
    df_temp = df1.copy().astype(str)
    df2 = read_json(df_temp.to_json())
    assert (df1 == df2).all()[0] == True # currently False

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.13.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_IN LOCALE: en_IN.ISO8859-1

pandas: 0.22.0 pytest: None pip: 9.0.1 setuptools: 39.0.1 Cython: None numpy: 1.14.0 scipy: None pyarrow: None xarray: None IPython: 5.1.0 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.1.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:17 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
Udayraj123commented, Apr 4, 2018

Sure! Though I’m new to source code of pandas, but I can try 😃

0reactions
ranieri-negricommented, Jul 15, 2022

up

Read more comments on GitHub >

github_iconTop Results From Across the Web

pandas read_json reads large integers as strings incorrectly
It is reading correctly on explictly specifying the argument dtype=int , but I don't understand why. What changes when we specify the dtype?...
Read more >
pandas read_json reads large integers as strings incorrectly ...
Coding example for the question pandas read_json reads large integers as strings incorrectly-Pandas,Python.
Read more >
pandas.read_json — pandas 1.5.2 documentation
pandas.read_json(path_or_buf, *, orient=None, typ='frame', dtype=None, convert_axes=None, convert_dates=True, ... Indication of expected JSON string format.
Read more >
Part 3 - Introduction to Pandas | ArcGIS API for Python
An integer Index is used if not specified. pd.DataFrame(np.random.rand(4, 2), ... read_json() can be used to read JSON (JavaScript Object Notation) files.
Read more >
apache_beam.dataframe.io module - Apache Beam
Read a comma-separated values (csv) file into DataFrame. ... If keep_default_na is False, and na_values are not specified, no strings will be parsed...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found