bug(duckdb): Can't cast string column to Enum or 'category'
See original GitHub issueHi, thank you for developing Ibis. I’m planning to use it for streamlining some data transformations and I bumped into this bug. Basically, I successfully registered a Parquet dataset and tried to convert a string column with categorical values to Enum. I also found some references into how duckdb and SQLAlchemy work with Enum types.
Here’s how to replicate the subsequent error using a memtable:
import pandas as pd
con = ibis.duckdb.connect(':memory:')
bugdf = pd.DataFrame({'colA': ['A','A','A','B','B'],
'colB': [ 1 , 2 , 3 , 4 , 5 ]})
bugdf
t = ibis.memtable(bugdf)
t
PandasInMemoryTable
data:
DataFrameProxy:
colA colB
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
And here is the cast with the traceback:
t['colA'].cast( 'category' ).execute()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In [45], line 1
----> 1 t['colA'].cast( 'category' ).execute()
File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/expr/types/core.py:291, in Expr.execute(self, limit, timecontext, params, **kwargs)
266 def execute(
267 self,
268 limit: int | str | None = 'default',
(...)
271 **kwargs: Any,
272 ):
273 """Execute an expression against its backend if one exists.
274
275 Parameters
(...)
289 Mapping of scalar parameter expressions to value
290 """
--> 291 return self._find_backend(use_default=True).execute(
292 self, limit=limit, timecontext=timecontext, params=params, **kwargs
293 )
File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/__init__.py:182, in BaseSQLBackend.execute(self, expr, params, limit, **kwargs)
178 kwargs.pop('timecontext', None)
179 query_ast = self.compiler.to_ast_ensure_limit(
180 expr, limit, params=params
181 )
--> 182 sql = query_ast.compile()
183 self._log(sql)
185 schema = self.ast_schema(query_ast, **kwargs)
File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/compiler/base.py:40, in QueryAST.compile(self)
38 def compile(self):
39 compiled_setup_queries = [q.compile() for q in self.setup_queries]
---> 40 compiled_queries = [q.compile() for q in self.queries]
41 compiled_teardown_queries = [
42 q.compile() for q in self.teardown_queries
43 ]
44 return self.context.collapse(
45 list(
46 chain(
(...)
51 )
52 )
File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/compiler/base.py:40, in <listcomp>(.0)
38 def compile(self):
39 compiled_setup_queries = [q.compile() for q in self.setup_queries]
---> 40 compiled_queries = [q.compile() for q in self.queries]
41 compiled_teardown_queries = [
42 q.compile() for q in self.teardown_queries
43 ]
44 return self.context.collapse(
45 list(
46 chain(
(...)
51 )
52 )
File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/alchemy/query_builder.py:193, in AlchemySelect.compile(self)
184 steps = [
185 self._add_select,
186 self._add_groupby,
(...)
189 self._add_limit,
190 ]
192 for step in steps:
--> 193 frag = step(frag)
195 return frag
File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/alchemy/query_builder.py:223, in AlchemySelect._add_select(self, table_set)
221 for expr in self.select_set:
222 if isinstance(expr, ir.Value):
--> 223 arg = self._translate(expr, named=True)
224 elif isinstance(expr, ir.Table):
225 if expr.equals(self.table_set):
File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/compiler/query_builder.py:253, in Select._translate(self, expr, named, permit_subquery)
246 def _translate(self, expr, named=False, permit_subquery=False):
247 translator = self.translator_class(
248 expr,
249 context=self.context,
250 named=named,
251 permit_subquery=permit_subquery,
252 )
--> 253 return translator.get_result()
File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/compiler/translator.py:221, in ExprTranslator.get_result(self)
219 def get_result(self):
220 """Compile SQL expression into a string."""
--> 221 translated = self.translate(self.expr)
222 if self._needs_name(self.expr):
223 # TODO: this could fail in various ways
224 name = self.expr.get_name()
File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/compiler/translator.py:256, in ExprTranslator.translate(self, expr)
254 elif type(op) in self._registry:
255 formatter = self._registry[type(op)]
--> 256 return formatter(self, expr)
257 else:
258 raise com.OperationNotDefinedError(
259 f'No translation rule for {type(op)}'
260 )
File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/alchemy/registry.py:228, in _alias(t, expr)
224 def _alias(t, expr):
225 # just compile the underlying argument because the naming is handled
226 # by the translator for the top level expression
227 op = expr.op()
--> 228 return t.translate(op.arg)
File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/compiler/translator.py:256, in ExprTranslator.translate(self, expr)
254 elif type(op) in self._registry:
255 formatter = self._registry[type(op)]
--> 256 return formatter(self, expr)
257 else:
258 raise com.OperationNotDefinedError(
259 f'No translation rule for {type(op)}'
260 )
File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/alchemy/registry.py:170, in _cast(t, expr)
167 arg, typ = expr.op().args
169 sa_arg = t.translate(arg)
--> 170 sa_type = t.get_sqla_type(typ)
172 if isinstance(arg, ir.CategoryValue) and typ == dt.int32:
173 return sa_arg
File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/alchemy/translator.py:57, in AlchemyExprTranslator.get_sqla_type(self, data_type)
56 def get_sqla_type(self, data_type):
---> 57 return to_sqla_type(data_type, type_map=self._type_map)
File ~/.pyenv/versions/3.10.7/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
885 if not args:
886 raise TypeError(f'{funcname} requires at least '
887 '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)
File ~/Library/Caches/pypoetry/virtualenvs/prtax-mz5tN2Pu-py3.10/lib/python3.10/site-packages/ibis/backends/base/sql/alchemy/datatypes.py:116, in to_sqla_type(itype, type_map)
114 if type_map is None:
115 type_map = ibis_type_to_sqla
--> 116 return type_map[type(itype)]
KeyError: <class 'ibis.expr.datatypes.core.Category'>
Looking over the files mentioned, the dictionary defined in ibis/backends/base/sql/alchemy/datatypes.py doesn’t have keys for dt.Enum
or dt.Category
yet, although I’m not sure how they should be added.
ibis_type_to_sqla = {
dt.Null: sa.types.NullType,
dt.Date: sa.Date,
dt.Time: sa.Time,
dt.Boolean: sa.Boolean,
dt.Binary: sa.LargeBinary,
dt.String: sa.Text,
dt.Decimal: sa.NUMERIC,
# Mantissa-based
dt.Float16: sa.REAL,
dt.Float32: sa.REAL,
# precision is the number of bits in the mantissa
# without specifying this, some backends interpret the type as FLOAT, which
# means float32 (and precision == 24)
dt.Float64: sa.Float(precision=53),
dt.Int8: sa.SmallInteger,
dt.Int16: sa.SmallInteger,
dt.Int32: sa.Integer,
dt.Int64: sa.BigInteger,
dt.UInt8: UInt8,
dt.UInt16: UInt16,
dt.UInt32: UInt32,
dt.UInt64: UInt64,
dt.JSON: sa.JSON,
}
Any help is greatly appreciated. Thank you!
Issue Analytics
- State:
- Created a year ago
- Comments:5
Top Results From Across the Web
Cannot convert strings to category/factor/enum · Issue #4054 - GitHub
I want to have a query result with categorical columns being encoded as pandas 'Category' or R 'factor' type for better memory efficiency....
Read more >DuckDB - The Lord of Enums
A better solution is to dictionary encode these columns. In dictionary encoding, the data is split into two parts: the category and the...
Read more >Error in duckdb - Rust - Docs.rs
Error when the value of a particular column is requested, but the type of the result in that column cannot be converted to...
Read more >PostgreSQL - SQLAlchemy 1.4 Documentation
ARRAY Types; JSON Types; HSTORE Type; ENUM Types ... specifies a sequence containing string column names, Column objects, and/or SQL expression elements, ...
Read more >Release Notes - Ibis Project
api: make sure column names that are already inferred are not ... remain in their creating thread (39bc537); duckdb: use fetch_arrow_table() ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Interesting, thanks @jzavala-gonzalez. As we were looking into this, it turns out that DuckDB doesn’t save enum columns as an actual enum type when exporting to Parquet–it’s just a string. But the Parquet format has dictionary encoding of strings, and so it’s basically just as efficient to store the same string repeatedly in a parquet file anyway.
We’ve actually been thinking about deprecating enum types in Ibis as they can be difficult to work with–every value has to be specified in the schema. Category types would be nerfed and become a string type by another name. So the cast you’re trying wouldn’t do much anyway.
Thanks for reporting the issue!
Hi, the step I was working on when I attempted
cast('category')
was related to ingesting an unprocessed parquet file into duckdb via Ibis and converting to appropriate datatypes that I can export to another parquet. There’s a few columns that basically repeat the same 3 or 4 values, so I figured I should try making them categorical.