Unable to read S3 parquet files using DuckDB as Database Connector
See original GitHub issueI was following the DuckDB setup as per this PR and was able to load the DB file and create charts.
As a next step I wanted to load S3 parquet files in to Superset using DuckDB in memory option duckdb:///:memory:
Before trying with Superset, I used below Python code to check the DuckDB S3 parquet loading and found it to be working:
- Without duckdb-engine
import duckdb
cursor = duckdb.connect()
cursor.execute("INSTALL httpfs;")
cursor.execute("LOAD httpfs;")
cursor.execute("SET s3_region='******'")
cursor.execute("SET s3_access_key_id=''**************")
cursor.execute("SET s3_secret_access_key='*****************************'")
cursor.execute("PRAGMA enable_profiling;")
cursor.execute("SELECT count(*) FROM read_parquet('s3://<bucket>/prefix/*.parquet'")
- With duckdb-engine
# pip install duckdb-engine
import sqlalchemy as sa
engine = sa.create_engine("duckdb:///:memory:")
with engine.connect() as conn:
conn.execute(
f"""
INSTALL httpfs;
LOAD httpfs;
SET s3_region='ap-south-1';
SET s3_access_key_id='****************';
SET s3_secret_access_key='*****************************';
"""
)
r = conn.execute(
"""
SELECT count(*) FROM read_parquet('s3://<bucket>/prefix/*.parquet');
"""
).fetchall()
for row in r:
print(dict(row))
When I tried to SET the S3 environment values in SQL Editor I was getting the below error on the UI:
Error:
DuckDB Error
duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"
Did you mean: "TimeZone"
This may be triggered by:
Issue 1002 - The database returned an unexpected error.
See less
Full Trace:
superset_cache | 1:M 14 Jul 2022 14:18:00.250 * Background saving terminated with success
superset_app | 127.0.0.1 - - [14/Jul/2022:14:18:24 +0000] "GET /health HTTP/1.1" 200 2 "-" "curl/7.74.0"
superset_app | 172.20.0.1 - - [14/Jul/2022:14:18:34 +0000] "GET /api/v1/me/ HTTP/1.1" 200 150 "http://localhost:8088/superset/sqllab/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
superset_app | 172.20.0.1 - - [14/Jul/2022:14:18:34 +0000] "GET /api/v1/me/ HTTP/1.1" 200 150 "http://localhost:8088/superset/sqllab/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
superset_app | Triggering query_id: 60
superset_app | 2022-07-14 14:18:35,809:INFO:superset.sqllab.command:Triggering query_id: 60
superset_app | Query 60: Executing 1 statement(s)
superset_app | 2022-07-14 14:18:35,942:INFO:superset.sql_lab:Query 60: Executing 1 statement(s)
superset_app | Query 60: Set query to 'running'
superset_app | 2022-07-14 14:18:35,943:INFO:superset.sql_lab:Query 60: Set query to 'running'
superset_app | Query 60: Running statement 1 out of 1
superset_app | 2022-07-14 14:18:36,083:INFO:superset.sql_lab:Query 60: Running statement 1 out of 1
superset_app | Query 60: <class 'RuntimeError'>
superset_app | Traceback (most recent call last):
superset_app | File "/app/superset/sql_lab.py", line 272, in execute_sql_statement
superset_app | db_engine_spec.execute(cursor, sql, async_=True)
superset_app | File "/app/superset/db_engine_specs/base.py", line 1261, in execute
superset_app | raise cls.get_dbapi_mapped_exception(ex)
superset_app | File "/app/superset/db_engine_specs/base.py", line 1259, in execute
superset_app | cursor.execute(query)
superset_app | File "/usr/local/lib/python3.8/site-packages/duckdb_engine/__init__.py", line 97, in execute
superset_app | raise e
superset_app | File "/usr/local/lib/python3.8/site-packages/duckdb_engine/__init__.py", line 85, in execute
superset_app | self.c.execute(statement)
superset_app | RuntimeError: Catalog Error: unrecognized configuration parameter "s3_region"
superset_app |
superset_app | Did you mean: "TimeZone"
superset_app | 2022-07-14 14:18:36,157:ERROR:superset.sql_lab:Query 60: <class 'RuntimeError'>
superset_app | Traceback (most recent call last):
superset_app | File "/app/superset/sql_lab.py", line 272, in execute_sql_statement
superset_app | db_engine_spec.execute(cursor, sql, async_=True)
superset_app | File "/app/superset/db_engine_specs/base.py", line 1261, in execute
superset_app | raise cls.get_dbapi_mapped_exception(ex)
superset_app | File "/app/superset/db_engine_specs/base.py", line 1259, in execute
superset_app | cursor.execute(query)
superset_app | File "/usr/local/lib/python3.8/site-packages/duckdb_engine/__init__.py", line 97, in execute
superset_app | raise e
superset_app | File "/usr/local/lib/python3.8/site-packages/duckdb_engine/__init__.py", line 85, in execute
superset_app | self.c.execute(statement)
superset_app | RuntimeError: Catalog Error: unrecognized configuration parameter "s3_region"
superset_app |
superset_app | Did you mean: "TimeZone"
superset_app | [SupersetError(message='duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"\n\nDid you mean: "TimeZone"', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'DuckDB', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]
superset_app | Traceback (most recent call last):
superset_app | File "/app/superset/views/base.py", line 207, in wraps
superset_app | return f(self, *args, **kwargs)
superset_app | File "/app/superset/utils/log.py", line 244, in wrapper
superset_app | value = f(*args, **kwargs)
superset_app | File "/app/superset/views/core.py", line 2412, in sql_json
superset_app | command_result: CommandResult = command.run()
superset_app | File "/app/superset/sqllab/command.py", line 114, in run
superset_app | raise ex
superset_app | File "/app/superset/sqllab/command.py", line 98, in run
superset_app | status = self._run_sql_json_exec_from_scratch()
superset_app | File "/app/superset/sqllab/command.py", line 151, in _run_sql_json_exec_from_scratch
superset_app | raise ex
superset_app | File "/app/superset/sqllab/command.py", line 146, in _run_sql_json_exec_from_scratch
superset_app | return self._sql_json_executor.execute(
superset_app | File "/app/superset/sqllab/sql_json_executer.py", line 111, in execute
superset_app | raise SupersetErrorsException(
superset_app | superset.exceptions.SupersetErrorsException: [SupersetError(message='duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"\n\nDid you mean: "TimeZone"', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'DuckDB', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]
superset_app | 2022-07-14 14:18:36,229:WARNING:superset.views.base:[SupersetError(message='duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"\n\nDid you mean: "TimeZone"', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'DuckDB', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]
superset_app | Traceback (most recent call last):
superset_app | File "/app/superset/views/base.py", line 207, in wraps
superset_app | return f(self, *args, **kwargs)
superset_app | File "/app/superset/utils/log.py", line 244, in wrapper
superset_app | value = f(*args, **kwargs)
superset_app | File "/app/superset/views/core.py", line 2412, in sql_json
superset_app | command_result: CommandResult = command.run()
superset_app | File "/app/superset/sqllab/command.py", line 114, in run
superset_app | raise ex
superset_app | File "/app/superset/sqllab/command.py", line 98, in run
superset_app | status = self._run_sql_json_exec_from_scratch()
superset_app | File "/app/superset/sqllab/command.py", line 151, in _run_sql_json_exec_from_scratch
superset_app | raise ex
superset_app | File "/app/superset/sqllab/command.py", line 146, in _run_sql_json_exec_from_scratch
superset_app | return self._sql_json_executor.execute(
superset_app | File "/app/superset/sqllab/sql_json_executer.py", line 111, in execute
superset_app | raise SupersetErrorsException(
superset_app | superset.exceptions.SupersetErrorsException: [SupersetError(message='duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"\n\nDid you mean: "TimeZone"', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'DuckDB', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]
Issue Analytics
- State:
- Created a year ago
- Comments:19 (7 by maintainers)
Top Results From Across the Web
S3 or GCS Parquet Import - DuckDB
After the HTTPFS extension is set up and the S3 configuration is set correctly, Parquet files can be read from S3 using the...
Read more >S3 Parquet Import - DuckDB
To load a Parquet file from S3, the HTTPFS extension is required. This can be installed use the INSTALL SQL command. This only...
Read more >Querying Parquet with Precision using DuckDB
DuckDB's zero-dependency Parquet reader is able to directly execute SQL queries on Parquet files without any import or analysis step. Because of ...
Read more >DuckDB quacks Arrow: A zero-copy data integration between ...
The datasets may span multiple files in Parquet, CSV, or other formats, and files may even be on remote or cloud storage like...
Read more >Parquet - DuckDB
DuckDB provides support for both reading and writing Parquet files in an ... entire database as parquet EXPORT DATABASE 'target_directory' (FORMAT PARQUET); ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

@Mageswaran1989 , @Mause
I tested with minio,
and it works,
you need check
Allow DMLIf it’s JSON, it probably just needs to be
falseinstead ofFalse