question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

JSON table orient not roundtripping extension types

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas
print(pandas._version.get_versions())
df = pandas.DataFrame(
    {
        1: ['v1','v2'],
        2: ['v3','v4']
    },dtype='string'
)

print(df.info())
for orient in ['table','split','records','values']:
    rdf = pandas.read_json(df.to_json(orient=orient), orient=orient)
    print(f'======{orient}======')
    print(rdf.info())

Problem description

string dtype not preserved with round trip serialization to JSON, so dataframes containing strings cannot be reused transparently

Expected Output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       2 non-null      string
 1   2       2 non-null      string
dtypes: string(2)
memory usage: 160.0 bytes
None

Actual output

======table======
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       0 non-null      object
 1   2       0 non-null      object
dtypes: object(2)
memory usage: 48.0+ bytes
None
======split======
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       2 non-null      object
 1   2       2 non-null      object
dtypes: object(2)
memory usage: 48.0+ bytes
None
======records======
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   1       2 non-null      object
 1   2       2 non-null      object
dtypes: object(2)
memory usage: 160.0+ bytes
None
======values======
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       2 non-null      object
 1   1       2 non-null      object
dtypes: object(2)
memory usage: 160.0+ bytes
None

Output of pd.show_versions()

{'dirty': False, 'error': None, 'full-revisionid': '29d6b0232aab9576afa896ff5bab0b994760495a', 'version': '1.0.1'}

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:23 (16 by maintainers)

github_iconTop GitHub Comments

1reaction
WillAydcommented, Feb 19, 2020

I think this has been a pretty enlightening discussion. I for one didn’t realize that we just wrote “string” in to_json for object dtypes. Makes sense from a historical perspective, but actually there are a lot of gaps in that design like this:

# Writes string for the dtype, but doesn't write strings in JSON
>>> pd.Series([1], dtype=object).to_json(orient="table")
'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"values","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"values":1}]}'

And this

# Same as above, but doesn't preserve key anyway
>>> pd.Series([{1: 2}]).to_json(orient="table")
'{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"values","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"values":{"1":2}}]}'

Would be tough to manage backwards compat but would support using string on the writing side appropriately and perhaps separate metadata for object dtypes, which would probably solve a lot of issues in round tripping

0reactions
Dr-Irvcommented, Dec 1, 2021

Maybe you can create a new issue specifically about supporting “date” for JSON table schema? (because this issue is still a bit more general about other extension types as well. Unless you would like to tackle the general issue for extension types, instead of first focusing on datetime.date?)

I agree with this. I have some comments/questions in response to https://github.com/pandas-dev/pandas/issues/20612#issuecomment-983768348 that would be better handled in a separate issue about dates.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Best practices for reading JSON data - Amazon Athena
In Amazon Athena, you can create tables from external data and include the JSON-encoded data in them. For such types of source data,...
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
This extra key is not standard but does enable JSON roundtrips for extension types (e.g. read_json(df.to_json(orient="table"), orient="table") ).
Read more >
How to Export Pandas DataFrame to JSON File - Data to Fish
Different JSON Formats. There are different ways to format the JSON string. You'll need to set the orient to your desired format. Here...
Read more >
pandas read json valueerror: trailing data
Prefix to add to column numbers when no header, e.g. file_object.read(size) The size represents the number of bytes to read from a file....
Read more >
JSON Class | Apex Reference Guide - Salesforce Developers
JSON class to perform round-trip JSON serialization and deserialization ... Deserializes the specified JSON string into collections of primitive data types.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found