Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

JSON table orient not roundtripping extension types

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas
print(pandas._version.get_versions())
df = pandas.DataFrame(
    {
        1: ['v1','v2'],
        2: ['v3','v4']
    },dtype='string'
)

print(df.info())
for orient in ['table','split','records','values']:
    rdf = pandas.read_json(df.to_json(orient=orient), orient=orient)
    print(f'======{orient}======')
    print(rdf.info())

Problem description

string dtype not preserved with round trip serialization to JSON, so dataframes containing strings cannot be reused transparently

Expected Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       2 non-null      string
 1   2       2 non-null      string
dtypes: string(2)
memory usage: 160.0 bytes
None

Actual output

======table======
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       0 non-null      object
 1   2       0 non-null      object
dtypes: object(2)
memory usage: 48.0+ bytes
None
======split======
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       2 non-null      object
 1   2       2 non-null      object
dtypes: object(2)
memory usage: 48.0+ bytes
None
======records======
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       2 non-null      object
 1   2       2 non-null      object
dtypes: object(2)
memory usage: 160.0+ bytes
None
======values======
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       2 non-null      object
 1   1       2 non-null      object
dtypes: object(2)
memory usage: 160.0+ bytes
None

Output of `pd.show_versions()`

{'dirty': False, 'error': None, 'full-revisionid': '29d6b0232aab9576afa896ff5bab0b994760495a', 'version': '1.0.1'}

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:23 (16 by maintainers)

Top GitHub Comments

1reaction

WillAydcommented, Feb 19, 2020

I think this has been a pretty enlightening discussion. I for one didn’t realize that we just wrote “string” in to_json for object dtypes. Makes sense from a historical perspective, but actually there are a lot of gaps in that design like this:

# Writes string for the dtype, but doesn't write strings in JSON
>>> pd.Series([1], dtype=object).to_json(orient="table")
'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"values","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"values":1}]}'

And this

# Same as above, but doesn't preserve key anyway
>>> pd.Series([{1: 2}]).to_json(orient="table")
'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"values","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"values":{"1":2}}]}'

Would be tough to manage backwards compat but would support using string on the writing side appropriately and perhaps separate metadata for object dtypes, which would probably solve a lot of issues in round tripping

0reactions

Dr-Irvcommented, Dec 1, 2021

Maybe you can create a new issue specifically about supporting “date” for JSON table schema? (because this issue is still a bit more general about other extension types as well. Unless you would like to tackle the general issue for extension types, instead of first focusing on datetime.date?)

I agree with this. I have some comments/questions in response to https://github.com/pandas-dev/pandas/issues/20612#issuecomment-983768348 that would be better handled in a separate issue about dates.

Top Results From Across the Web

Best practices for reading JSON data - Amazon Athena

In Amazon Athena, you can create tables from external data and include the JSON-encoded data in them. For such types of source data,...

IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation

This extra key is not standard but does enable JSON roundtrips for extension types (e.g. read_json(df.to_json(orient="table"), orient="table") ).

How to Export Pandas DataFrame to JSON File - Data to Fish

Different JSON Formats. There are different ways to format the JSON string. You'll need to set the orient to your desired format. Here...

pandas read json valueerror: trailing data

Prefix to add to column numbers when no header, e.g. file_object.read(size) The size represents the number of bytes to read from a file....

JSON Class | Apex Reference Guide - Salesforce Developers

JSON class to perform round-trip JSON serialization and deserialization ... Deserializes the specified JSON string into collections of primitive data types.