JSON table orient not roundtripping extension types
See original GitHub issueCode Sample, a copy-pastable example if possible
import pandas
print(pandas._version.get_versions())
df = pandas.DataFrame(
{
1: ['v1','v2'],
2: ['v3','v4']
},dtype='string'
)
print(df.info())
for orient in ['table','split','records','values']:
rdf = pandas.read_json(df.to_json(orient=orient), orient=orient)
print(f'======{orient}======')
print(rdf.info())
Problem description
string dtype not preserved with round trip serialization to JSON, so dataframes containing strings cannot be reused transparently
Expected Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 1 2 non-null string
1 2 2 non-null string
dtypes: string(2)
memory usage: 160.0 bytes
None
Actual output
======table======
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 1 0 non-null object
1 2 0 non-null object
dtypes: object(2)
memory usage: 48.0+ bytes
None
======split======
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 1 2 non-null object
1 2 2 non-null object
dtypes: object(2)
memory usage: 48.0+ bytes
None
======records======
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 1 2 non-null object
1 2 2 non-null object
dtypes: object(2)
memory usage: 160.0+ bytes
None
======values======
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 2 non-null object
1 1 2 non-null object
dtypes: object(2)
memory usage: 160.0+ bytes
None
Output of pd.show_versions()
{'dirty': False, 'error': None, 'full-revisionid': '29d6b0232aab9576afa896ff5bab0b994760495a', 'version': '1.0.1'}
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:23 (16 by maintainers)
Top Results From Across the Web
Best practices for reading JSON data - Amazon Athena
In Amazon Athena, you can create tables from external data and include the JSON-encoded data in them. For such types of source data,...
Read more >IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
This extra key is not standard but does enable JSON roundtrips for extension types (e.g. read_json(df.to_json(orient="table"), orient="table") ).
Read more >How to Export Pandas DataFrame to JSON File - Data to Fish
Different JSON Formats. There are different ways to format the JSON string. You'll need to set the orient to your desired format. Here...
Read more >pandas read json valueerror: trailing data
Prefix to add to column numbers when no header, e.g. file_object.read(size) The size represents the number of bytes to read from a file....
Read more >JSON Class | Apex Reference Guide - Salesforce Developers
JSON class to perform round-trip JSON serialization and deserialization ... Deserializes the specified JSON string into collections of primitive data types.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think this has been a pretty enlightening discussion. I for one didn’t realize that we just wrote “string” in to_json for object dtypes. Makes sense from a historical perspective, but actually there are a lot of gaps in that design like this:
And this
Would be tough to manage backwards compat but would support using
string
on the writing side appropriately and perhaps separate metadata forobject
dtypes, which would probably solve a lot of issues in round trippingI agree with this. I have some comments/questions in response to https://github.com/pandas-dev/pandas/issues/20612#issuecomment-983768348 that would be better handled in a separate issue about dates.