read_json reads large integers as strings incorrectly if dtype not explicitly mentioned
See original GitHub issueCode Sample (Original Problem)
json_content="""
{
"1": {
"tid": "9999999999999998",
},
"2": {
"tid": "9999999999999999",
},
"3": {
"tid": "10000000000000001",
},
"4": {
"tid": "10000000000000002",
}
}
"""
df=pd.read_json(json_content,
orient='index', # read as transposed
convert_axes=False, # don't convert keys to dates
)
print(df.info())
print(df)
Problem description
I’m using pandas to load json data, but found some strange behaviour in the read_json
function.
In the above code, the integers as strings aren’t read correctly, though there shouldn’t be an overflow case as the values are well within the integer range.
It is reading correctly on explictly specifying the argument dtype=int
, but I don’t understand why. What changes when we specify the dtype?
Corresponding SO discussion here:
Current Output
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid 4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
tid
1 9999999999999998
2 10000000000000000
3 10000000000000000
4 10000000000000002
Expected Output
The tid’s should have been stored correctly.
None
tid
1 9999999999999998
2 9999999999999999
3 10000000000000001
4 10000000000000002
A minimal pytest example
import pytest
import numpy as np
import pandas as pd
from pandas import (Series, DataFrame, DatetimeIndex, Timestamp, read_json, compat)
from pandas.util import testing as tm
@pytest.mark.parametrize('dtype', ['int'])
def test_large_ints_from_json_strings(dtype):
# GH 20608
df1 = DataFrame([9999999999999999,10000000000000001],columns=['tid'])
df_temp = df1.copy().astype(str)
df2 = read_json(df_temp.to_json())
assert (df1 == df2).all()[0] == True # currently False
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.13.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_IN LOCALE: en_IN.ISO8859-1
pandas: 0.22.0 pytest: None pip: 9.0.1 setuptools: 39.0.1 Cython: None numpy: 1.14.0 scipy: None pyarrow: None xarray: None IPython: 5.1.0 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.1.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Comments:17 (7 by maintainers)
Sure! Though I’m new to source code of pandas, but I can try 😃
up