Kartothek metadata does not distinguish int64 and Int64
See original GitHub issueIn [1]: from functools import partial
...:
...: import pandas as pd
...: import storefact
...: from kartothek.core.common_metadata import empty_dataframe_from_schema
...: from kartothek.io.dask.dataframe import read_dataset_as_ddf
...: from kartothek.io.eager import read_table, store_dataframes_as_dataset
...:
...: store = partial(storefact.get_store_from_url, "hfs:///tmp")
...:
...: df = pd.DataFrame(
...: {
...: "i": [0, 1],
...: "I": [0, pd.NA],
...: "o": ["a", None],
...: "s": ["b", pd.NA],
...: "b": [True, False],
...: "B": [True, pd.NA],
...: }
...: )
...: df["I"] = df["I"].astype("Int64")
...: df["s"] = df["s"].astype("string")
...: df["B"] = df["B"].astype("boolean")
...:
...: df.dtypes
Out[1]:
i int64
I Int64
o object
s string
b bool
B boolean
dtype: object
In [2]: dm = store_dataframes_as_dataset(
...: dfs=[df], dataset_uuid="dataset", store=store, overwrite=True
...: )
...:
...: df_meta = empty_dataframe_from_schema(schema=dm.table_meta["table"])
...: df_meta.dtypes
Out[2]:
B boolean
I int64
b bool
i int64
o object
s string
dtype: object
In [3]: ddf = read_dataset_as_ddf(dataset_uuid="dataset", store=store)
...: ddf._meta.dtypes
Out[3]:
B boolean
I int64
b bool
i int64
o object
s string
dtype: object
In [4]: ddf.compute().dtypes
Out[4]:
B boolean
I Int64
b bool
i int64
o object
s string
dtype: object
Copy pastable below:
from functools import partial
import pandas as pd
import storefact
from kartothek.core.common_metadata import empty_dataframe_from_schema
from kartothek.io.dask.dataframe import read_dataset_as_ddf
from kartothek.io.eager import read_table, store_dataframes_as_dataset
store = partial(storefact.get_store_from_url, "hfs:///tmp")
df = pd.DataFrame(
{
"i": [0, 1],
"I": [0, pd.NA],
"o": ["a", None],
"s": ["b", pd.NA],
"b": [True, False],
"B": [True, pd.NA],
}
)
df["I"] = df["I"].astype("Int64")
df["s"] = df["s"].astype("string")
df["B"] = df["B"].astype("boolean")
df.dtypes
dm = store_dataframes_as_dataset(
dfs=[df], dataset_uuid="dataset", store=store, overwrite=True
)
df_meta = empty_dataframe_from_schema(schema=dm.table_meta["table"])
df_meta.dtypes
ddf = read_dataset_as_ddf(dataset_uuid="dataset", store=store)
ddf._meta.dtypes
ddf.compute().dtypes
The dtype is incorrectly stored in the kartothek metadata:
In [12]: store().get(dataset/table/_common_metadata)
Out[12]: b'PAR1\x15\x04\x19|5\x00\x18\x06schema\x15\x0c\x00\x15\x00%\x02\x18\x01B\x00\x15\x04%\x02\x18\x01I\x00\x15\x00%\x02\x18\x01b\x00\x15\x04%\x02\x18\x01i\x00\x15\x0c%\x02\x18\x01o%\x00L\x1c\x00\x00\x00\x15\x0c%\x02\x18\x01s%\x00L\x1c\x00\x00\x00\x16\x00\x19\x0c\x19,\x18\x06pandas\x18\x99\x07{"column_indexes": [{"field_name": null, "metadata": {"encoding": "UTF-8"}, "name": null, "numpy_type": "object", "pandas_type": "unicode"}], "columns": [{"field_name": "B", "metadata": null, "name": "B", "numpy_type": "boolean", "pandas_type": "bool"}, {"field_name": "I", "metadata": null, "name": "I", "numpy_type": "int64", "pandas_type": "int64"}, {"field_name": "b", "metadata": null, "name": "b", "numpy_type": "bool", "pandas_type": "bool"}, {"field_name": "i", "metadata": null, "name": "i", "numpy_type": "int64", "pandas_type": "int64"}, {"field_name": "o", "metadata": null, "name": "o", "numpy_type": "object", "pandas_type": "unicode"}, {"field_name": "s", "metadata": null, "name": "s", "numpy_type": "string", "pandas_type": "unicode"}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "index_columns": [{"kind": "range", "name": null, "start": 0, "step": 1, "stop": 2}], "pandas_version": "1.1.3"}\x00\x18\x0cARROW:schema\x18\xe0\r/////yAFAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAANADAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAJkDAAB7ImNvbHVtbl9pbmRleGVzIjogW3siZmllbGRfbmFtZSI6IG51bGwsICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifSwgIm5hbWUiOiBudWxsLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9XSwgImNvbHVtbnMiOiBbeyJmaWVsZF9uYW1lIjogIkIiLCAibWV0YWRhdGEiOiBudWxsLCAibmFtZSI6ICJCIiwgIm51bXB5X3R5cGUiOiAiYm9vbGVhbiIsICJwYW5kYXNfdHlwZSI6ICJib29sIn0sIHsiZmllbGRfbmFtZSI6ICJJIiwgIm1ldGFkYXRhIjogbnVsbCwgIm5hbWUiOiAiSSIsICJudW1weV90eXBlIjogImludDY0IiwgInBhbmRhc190eXBlIjogImludDY0In0sIHsiZmllbGRfbmFtZSI6ICJiIiwgIm1ldGFkYXRhIjogbnVsbCwgIm5hbWUiOiAiYiIsICJudW1weV90eXBlIjogImJvb2wiLCAicGFuZGFzX3R5cGUiOiAiYm9vbCJ9LCB7ImZpZWxkX25hbWUiOiAiaSIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogImkiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCJ9LCB7ImZpZWxkX25hbWUiOiAibyIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogIm8iLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9LCB7ImZpZWxkX25hbWUiOiAicyIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogInMiLCAibnVtcHlfdHlwZSI6ICJzdHJpbmciLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjEuMC4xIn0sICJpbmRleF9jb2x1bW5zIjogW3sia2luZCI6ICJyYW5nZSIsICJuYW1lIjogbnVsbCwgInN0YXJ0IjogMCwgInN0ZXAiOiAxLCAic3RvcCI6IDJ9XSwgInBhbmRhc192ZXJzaW9uIjogIjEuMS4zIn0AAAAGAAAA9AAAAKwAAACEAAAAVAAAACwAAAAEAAAANP///wAAAQUUAAAADAAAAAQAAAAAAAAAJP///wEAAABzAAAAWP///wAAAQUUAAAADAAAAAQAAAAAAAAASP///wEAAABvAAAAfP///wAAAQIcAAAADAAAAAQAAAAAAAAAsP///wAAAAFAAAAAAQAAAGkAAACo////AAABBhQAAAAMAAAABAAAAAAAAACY////AQAAAGIAAADM////AAABAiQAAAAUAAAABAAAAAAAAAAIAAwACAAHAAgAAAAAAAABQAAAAAEAAABJAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEGGAAAABAAAAAEAAAAAAAAAAQABAAEAAAAAQAAAEIAAAAAAAAA\x00\x18"parquet-cpp version 1.5.1-SNAPSHOT\x19l\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x00#\x0b\x00\x00PAR1'
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Issues · JDASoftwareGroup/kartothek - GitHub
Contribute to JDASoftwareGroup/kartothek development by creating an account on GitHub. ... Kartothek metadata does not distinguish int64 and Int64.
Read more >Table type system — kartothek 5.3.1.dev0+g1821ea5 ...
Unsigned integers are used to store whole non-negative numbers with numerical information (often counts like number of cars) or to hold IDs ( ......
Read more >What is the difference between int, Int16, Int32 and Int64?
int and int32 can be synonymous, but they need not be. Nowadays, most systems sold are 64-bit in which case an int will...
Read more >Difference between Int16, Int32 and Int64 in C# - GeeksforGeeks
Int16: This Struct is used to represents 16-bit signed integer. The Int16 can store both types of values including negative and positive between ......
Read more >Understand predicate pushdown on row group level in ...
Apache Parquet is a columnar file format to work with gigabytes of ... as INT64 types on disk but are represented as timestamps...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
We do have tests for categoricals but not systematically as part of the “all types dataframes”
We’re not using these in production but have tests (that need skipping). Is there a specific reason you are not testing for categoricals?