question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

bug(duckdb): Can't cast string column to Enum or 'category'

See original GitHub issue

Hi, thank you for developing Ibis. I’m planning to use it for streamlining some data transformations and I bumped into this bug. Basically, I successfully registered a Parquet dataset and tried to convert a string column with categorical values to Enum. I also found some references into how duckdb and SQLAlchemy work with Enum types.

Here’s how to replicate the subsequent error using a memtable:

import pandas as pd
con = ibis.duckdb.connect(':memory:')
bugdf = pd.DataFrame({'colA': ['A','A','A','B','B'],
                      'colB': [ 1 , 2 , 3 , 4 , 5 ]})
bugdf

t = ibis.memtable(bugdf)
t
PandasInMemoryTable
  data:
    DataFrameProxy:
        colA  colB
      0    A     1
      1    A     2
      2    A     3
      3    B     4
      4    B     5

And here is the cast with the traceback:

t['colA'].cast( 'category' ).execute()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In [45], line 1
----> 1 t['colA'].cast( 'category' ).execute()

File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/expr/types/core.py:291, in Expr.execute(self, limit, timecontext, params, **kwargs)
    266 def execute(
    267     self,
    268     limit: int | str | None = 'default',
   (...)
    271     **kwargs: Any,
    272 ):
    273     """Execute an expression against its backend if one exists.
    274 
    275     Parameters
   (...)
    289         Mapping of scalar parameter expressions to value
    290     """
--> 291     return self._find_backend(use_default=True).execute(
    292         self, limit=limit, timecontext=timecontext, params=params, **kwargs
    293     )

File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/__init__.py:182, in BaseSQLBackend.execute(self, expr, params, limit, **kwargs)
    178 kwargs.pop('timecontext', None)
    179 query_ast = self.compiler.to_ast_ensure_limit(
    180     expr, limit, params=params
    181 )
--> 182 sql = query_ast.compile()
    183 self._log(sql)
    185 schema = self.ast_schema(query_ast, **kwargs)

File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/compiler/base.py:40, in QueryAST.compile(self)
     38 def compile(self):
     39     compiled_setup_queries = [q.compile() for q in self.setup_queries]
---> 40     compiled_queries = [q.compile() for q in self.queries]
     41     compiled_teardown_queries = [
     42         q.compile() for q in self.teardown_queries
     43     ]
     44     return self.context.collapse(
     45         list(
     46             chain(
   (...)
     51         )
     52     )

File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/compiler/base.py:40, in <listcomp>(.0)
     38 def compile(self):
     39     compiled_setup_queries = [q.compile() for q in self.setup_queries]
---> 40     compiled_queries = [q.compile() for q in self.queries]
     41     compiled_teardown_queries = [
     42         q.compile() for q in self.teardown_queries
     43     ]
     44     return self.context.collapse(
     45         list(
     46             chain(
   (...)
     51         )
     52     )

File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/alchemy/query_builder.py:193, in AlchemySelect.compile(self)
    184 steps = [
    185     self._add_select,
    186     self._add_groupby,
   (...)
    189     self._add_limit,
    190 ]
    192 for step in steps:
--> 193     frag = step(frag)
    195 return frag

File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/alchemy/query_builder.py:223, in AlchemySelect._add_select(self, table_set)
    221 for expr in self.select_set:
    222     if isinstance(expr, ir.Value):
--> 223         arg = self._translate(expr, named=True)
    224     elif isinstance(expr, ir.Table):
    225         if expr.equals(self.table_set):

File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/compiler/query_builder.py:253, in Select._translate(self, expr, named, permit_subquery)
    246 def _translate(self, expr, named=False, permit_subquery=False):
    247     translator = self.translator_class(
    248         expr,
    249         context=self.context,
    250         named=named,
    251         permit_subquery=permit_subquery,
    252     )
--> 253     return translator.get_result()

File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/compiler/translator.py:221, in ExprTranslator.get_result(self)
    219 def get_result(self):
    220     """Compile SQL expression into a string."""
--> 221     translated = self.translate(self.expr)
    222     if self._needs_name(self.expr):
    223         # TODO: this could fail in various ways
    224         name = self.expr.get_name()

File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/compiler/translator.py:256, in ExprTranslator.translate(self, expr)
    254 elif type(op) in self._registry:
    255     formatter = self._registry[type(op)]
--> 256     return formatter(self, expr)
    257 else:
    258     raise com.OperationNotDefinedError(
    259         f'No translation rule for {type(op)}'
    260     )

File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/alchemy/registry.py:228, in _alias(t, expr)
    224 def _alias(t, expr):
    225     # just compile the underlying argument because the naming is handled
    226     # by the translator for the top level expression
    227     op = expr.op()
--> 228     return t.translate(op.arg)

File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/compiler/translator.py:256, in ExprTranslator.translate(self, expr)
    254 elif type(op) in self._registry:
    255     formatter = self._registry[type(op)]
--> 256     return formatter(self, expr)
    257 else:
    258     raise com.OperationNotDefinedError(
    259         f'No translation rule for {type(op)}'
    260     )

File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/alchemy/registry.py:170, in _cast(t, expr)
    167 arg, typ = expr.op().args
    169 sa_arg = t.translate(arg)
--> 170 sa_type = t.get_sqla_type(typ)
    172 if isinstance(arg, ir.CategoryValue) and typ == dt.int32:
    173     return sa_arg

File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/alchemy/translator.py:57, in AlchemyExprTranslator.get_sqla_type(self, data_type)
     56 def get_sqla_type(self, data_type):
---> 57     return to_sqla_type(data_type, type_map=self._type_map)

File ~/.pyenv/versions/3.10.7/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
    885 if not args:
    886     raise TypeError(f'{funcname} requires at least '
    887                     '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)

File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/alchemy/datatypes.py:116, in to_sqla_type(itype, type_map)
    114 if type_map is None:
    115     type_map = ibis_type_to_sqla
--> 116 return type_map[type(itype)]

KeyError: <class 'ibis.expr.datatypes.core.Category'>

Looking over the files mentioned, the dictionary defined in ibis/backends/base/sql/alchemy/datatypes.py doesn’t have keys for dt.Enum or dt.Category yet, although I’m not sure how they should be added.

ibis_type_to_sqla = {
    dt.Null: sa.types.NullType,
    dt.Date: sa.Date,
    dt.Time: sa.Time,
    dt.Boolean: sa.Boolean,
    dt.Binary: sa.LargeBinary,
    dt.String: sa.Text,
    dt.Decimal: sa.NUMERIC,
    # Mantissa-based
    dt.Float16: sa.REAL,
    dt.Float32: sa.REAL,
    # precision is the number of bits in the mantissa
    # without specifying this, some backends interpret the type as FLOAT, which
    # means float32 (and precision == 24)
    dt.Float64: sa.Float(precision=53),
    dt.Int8: sa.SmallInteger,
    dt.Int16: sa.SmallInteger,
    dt.Int32: sa.Integer,
    dt.Int64: sa.BigInteger,
    dt.UInt8: UInt8,
    dt.UInt16: UInt16,
    dt.UInt32: UInt32,
    dt.UInt64: UInt64,
    dt.JSON: sa.JSON,
}

Any help is greatly appreciated. Thank you!

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
saulpwcommented, Nov 3, 2022

Interesting, thanks @jzavala-gonzalez. As we were looking into this, it turns out that DuckDB doesn’t save enum columns as an actual enum type when exporting to Parquet–it’s just a string. But the Parquet format has dictionary encoding of strings, and so it’s basically just as efficient to store the same string repeatedly in a parquet file anyway.

We’ve actually been thinking about deprecating enum types in Ibis as they can be difficult to work with–every value has to be specified in the schema. Category types would be nerfed and become a string type by another name. So the cast you’re trying wouldn’t do much anyway.

Thanks for reporting the issue!

0reactions
jzavala-gonzalezcommented, Nov 3, 2022

Hi, the step I was working on when I attempted cast('category') was related to ingesting an unprocessed parquet file into duckdb via Ibis and converting to appropriate datatypes that I can export to another parquet. There’s a few columns that basically repeat the same 3 or 4 values, so I figured I should try making them categorical.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cannot convert strings to category/factor/enum · Issue #4054 - GitHub
I want to have a query result with categorical columns being encoded as pandas 'Category' or R 'factor' type for better memory efficiency....
Read more >
DuckDB - The Lord of Enums
A better solution is to dictionary encode these columns. In dictionary encoding, the data is split into two parts: the category and the...
Read more >
Error in duckdb - Rust - Docs.rs
Error when the value of a particular column is requested, but the type of the result in that column cannot be converted to...
Read more >
PostgreSQL - SQLAlchemy 1.4 Documentation
ARRAY Types; JSON Types; HSTORE Type; ENUM Types ... specifies a sequence containing string column names, Column objects, and/or SQL expression elements, ...
Read more >
Release Notes - Ibis Project
api: make sure column names that are already inferred are not ... remain in their creating thread (39bc537); duckdb: use fetch_arrow_table() ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found