question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kartothek metadata does not distinguish int64 and Int64

See original GitHub issue
In [1]: from functools import partial
   ...: 
   ...: import pandas as pd
   ...: import storefact
   ...: from kartothek.core.common_metadata import empty_dataframe_from_schema
   ...: from kartothek.io.dask.dataframe import read_dataset_as_ddf
   ...: from kartothek.io.eager import read_table, store_dataframes_as_dataset
   ...: 
   ...: store = partial(storefact.get_store_from_url, "hfs:///tmp")
   ...: 
   ...: df = pd.DataFrame(
   ...:     {
   ...:         "i": [0, 1],
   ...:         "I": [0, pd.NA],
   ...:         "o": ["a", None],
   ...:         "s": ["b", pd.NA],
   ...:         "b": [True, False],
   ...:         "B": [True, pd.NA],
   ...:     }
   ...: )
   ...: df["I"] = df["I"].astype("Int64")
   ...: df["s"] = df["s"].astype("string")
   ...: df["B"] = df["B"].astype("boolean")
   ...: 
   ...: df.dtypes
Out[1]: 
i      int64
I      Int64
o     object
s     string
b       bool
B    boolean
dtype: object

In [2]: dm = store_dataframes_as_dataset(
   ...:     dfs=[df], dataset_uuid="dataset", store=store, overwrite=True
   ...: )
   ...: 
   ...: df_meta = empty_dataframe_from_schema(schema=dm.table_meta["table"])
   ...: df_meta.dtypes
Out[2]: 
B    boolean
I      int64
b       bool
i      int64
o     object
s     string
dtype: object

In [3]: ddf = read_dataset_as_ddf(dataset_uuid="dataset", store=store)
   ...: ddf._meta.dtypes
Out[3]: 
B    boolean
I      int64
b       bool
i      int64
o     object
s     string
dtype: object

In [4]: ddf.compute().dtypes
Out[4]: 
B    boolean
I      Int64
b       bool
i      int64
o     object
s     string
dtype: object

Copy pastable below:

from functools import partial

import pandas as pd
import storefact
from kartothek.core.common_metadata import empty_dataframe_from_schema
from kartothek.io.dask.dataframe import read_dataset_as_ddf
from kartothek.io.eager import read_table, store_dataframes_as_dataset

store = partial(storefact.get_store_from_url, "hfs:///tmp")

df = pd.DataFrame(
    {
        "i": [0, 1],
        "I": [0, pd.NA],
        "o": ["a", None],
        "s": ["b", pd.NA],
        "b": [True, False],
        "B": [True, pd.NA],
    }
)
df["I"] = df["I"].astype("Int64")
df["s"] = df["s"].astype("string")
df["B"] = df["B"].astype("boolean")

df.dtypes

dm = store_dataframes_as_dataset(
    dfs=[df], dataset_uuid="dataset", store=store, overwrite=True
)

df_meta = empty_dataframe_from_schema(schema=dm.table_meta["table"])
df_meta.dtypes

ddf = read_dataset_as_ddf(dataset_uuid="dataset", store=store)
ddf._meta.dtypes

ddf.compute().dtypes

The dtype is incorrectly stored in the kartothek metadata:

In [12]: store().get(dataset/table/_common_metadata)
Out[12]: b'PAR1\x15\x04\x19|5\x00\x18\x06schema\x15\x0c\x00\x15\x00%\x02\x18\x01B\x00\x15\x04%\x02\x18\x01I\x00\x15\x00%\x02\x18\x01b\x00\x15\x04%\x02\x18\x01i\x00\x15\x0c%\x02\x18\x01o%\x00L\x1c\x00\x00\x00\x15\x0c%\x02\x18\x01s%\x00L\x1c\x00\x00\x00\x16\x00\x19\x0c\x19,\x18\x06pandas\x18\x99\x07{"column_indexes": [{"field_name": null, "metadata": {"encoding": "UTF-8"}, "name": null, "numpy_type": "object", "pandas_type": "unicode"}], "columns": [{"field_name": "B", "metadata": null, "name": "B", "numpy_type": "boolean", "pandas_type": "bool"}, {"field_name": "I", "metadata": null, "name": "I", "numpy_type": "int64", "pandas_type": "int64"}, {"field_name": "b", "metadata": null, "name": "b", "numpy_type": "bool", "pandas_type": "bool"}, {"field_name": "i", "metadata": null, "name": "i", "numpy_type": "int64", "pandas_type": "int64"}, {"field_name": "o", "metadata": null, "name": "o", "numpy_type": "object", "pandas_type": "unicode"}, {"field_name": "s", "metadata": null, "name": "s", "numpy_type": "string", "pandas_type": "unicode"}], "creator": {"library": "pyarrow", "version": "1.0.1"}, "index_columns": [{"kind": "range", "name": null, "start": 0, "step": 1, "stop": 2}], "pandas_version": "1.1.3"}\x00\x18\x0cARROW:schema\x18\xe0\r/////yAFAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAANADAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAJkDAAB7ImNvbHVtbl9pbmRleGVzIjogW3siZmllbGRfbmFtZSI6IG51bGwsICJtZXRhZGF0YSI6IHsiZW5jb2RpbmciOiAiVVRGLTgifSwgIm5hbWUiOiBudWxsLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9XSwgImNvbHVtbnMiOiBbeyJmaWVsZF9uYW1lIjogIkIiLCAibWV0YWRhdGEiOiBudWxsLCAibmFtZSI6ICJCIiwgIm51bXB5X3R5cGUiOiAiYm9vbGVhbiIsICJwYW5kYXNfdHlwZSI6ICJib29sIn0sIHsiZmllbGRfbmFtZSI6ICJJIiwgIm1ldGFkYXRhIjogbnVsbCwgIm5hbWUiOiAiSSIsICJudW1weV90eXBlIjogImludDY0IiwgInBhbmRhc190eXBlIjogImludDY0In0sIHsiZmllbGRfbmFtZSI6ICJiIiwgIm1ldGFkYXRhIjogbnVsbCwgIm5hbWUiOiAiYiIsICJudW1weV90eXBlIjogImJvb2wiLCAicGFuZGFzX3R5cGUiOiAiYm9vbCJ9LCB7ImZpZWxkX25hbWUiOiAiaSIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogImkiLCAibnVtcHlfdHlwZSI6ICJpbnQ2NCIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCJ9LCB7ImZpZWxkX25hbWUiOiAibyIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogIm8iLCAibnVtcHlfdHlwZSI6ICJvYmplY3QiLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9LCB7ImZpZWxkX25hbWUiOiAicyIsICJtZXRhZGF0YSI6IG51bGwsICJuYW1lIjogInMiLCAibnVtcHlfdHlwZSI6ICJzdHJpbmciLCAicGFuZGFzX3R5cGUiOiAidW5pY29kZSJ9XSwgImNyZWF0b3IiOiB7ImxpYnJhcnkiOiAicHlhcnJvdyIsICJ2ZXJzaW9uIjogIjEuMC4xIn0sICJpbmRleF9jb2x1bW5zIjogW3sia2luZCI6ICJyYW5nZSIsICJuYW1lIjogbnVsbCwgInN0YXJ0IjogMCwgInN0ZXAiOiAxLCAic3RvcCI6IDJ9XSwgInBhbmRhc192ZXJzaW9uIjogIjEuMS4zIn0AAAAGAAAA9AAAAKwAAACEAAAAVAAAACwAAAAEAAAANP///wAAAQUUAAAADAAAAAQAAAAAAAAAJP///wEAAABzAAAAWP///wAAAQUUAAAADAAAAAQAAAAAAAAASP///wEAAABvAAAAfP///wAAAQIcAAAADAAAAAQAAAAAAAAAsP///wAAAAFAAAAAAQAAAGkAAACo////AAABBhQAAAAMAAAABAAAAAAAAACY////AQAAAGIAAADM////AAABAiQAAAAUAAAABAAAAAAAAAAIAAwACAAHAAgAAAAAAAABQAAAAAEAAABJAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEGGAAAABAAAAAEAAAAAAAAAAQABAAEAAAAAQAAAEIAAAAAAAAA\x00\x18"parquet-cpp version 1.5.1-SNAPSHOT\x19l\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x1c\x00\x00\x00#\x0b\x00\x00PAR1'

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
fjettercommented, Mar 11, 2021

Is there a specific reason you are not testing for categoricals

We do have tests for categoricals but not systematically as part of the “all types dataframes”

0reactions
mlondschiencommented, Feb 25, 2021

We’re not using these in production but have tests (that need skipping). Is there a specific reason you are not testing for categoricals?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · JDASoftwareGroup/kartothek - GitHub
Contribute to JDASoftwareGroup/kartothek development by creating an account on GitHub. ... Kartothek metadata does not distinguish int64 and Int64.
Read more >
Table type system — kartothek 5.3.1.dev0+g1821ea5 ...
Unsigned integers are used to store whole non-negative numbers with numerical information (often counts like number of cars) or to hold IDs ( ......
Read more >
What is the difference between int, Int16, Int32 and Int64?
int and int32 can be synonymous, but they need not be. Nowadays, most systems sold are 64-bit in which case an int will...
Read more >
Difference between Int16, Int32 and Int64 in C# - GeeksforGeeks
Int16: This Struct is used to represents 16-bit signed integer. The Int16 can store both types of values including negative and positive between ......
Read more >
Understand predicate pushdown on row group level in ...
Apache Parquet is a columnar file format to work with gigabytes of ... as INT64 types on disk but are represented as timestamps...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found