question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to read S3 parquet files using DuckDB as Database Connector

See original GitHub issue

I was following the DuckDB setup as per this PR and was able to load the DB file and create charts.

As a next step I wanted to load S3 parquet files in to Superset using DuckDB in memory option duckdb:///:memory:

Before trying with Superset, I used below Python code to check the DuckDB S3 parquet loading and found it to be working:

import duckdb
cursor = duckdb.connect()
cursor.execute("INSTALL httpfs;")
cursor.execute("LOAD httpfs;")
cursor.execute("SET s3_region='******'")
cursor.execute("SET s3_access_key_id=''**************")
cursor.execute("SET s3_secret_access_key='*****************************'")
cursor.execute("PRAGMA enable_profiling;")
cursor.execute("SELECT count(*) FROM read_parquet('s3://<bucket>/prefix/*.parquet'")
# pip install duckdb-engine
import sqlalchemy as sa


engine = sa.create_engine("duckdb:///:memory:")

with engine.connect() as conn:
    conn.execute(
        f"""
            INSTALL httpfs;

            LOAD httpfs;

            SET s3_region='ap-south-1';
            SET s3_access_key_id='****************';
            SET s3_secret_access_key='*****************************';
        """
    )
    r = conn.execute(
        """
           SELECT count(*) FROM read_parquet('s3://<bucket>/prefix/*.parquet');
        """
    ).fetchall()
    for row in r:
      print(dict(row))

When I tried to SET the S3 environment values in SQL Editor I was getting the below error on the UI:

Error:

DuckDB Error
duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"

Did you mean: "TimeZone"


This may be triggered by:
Issue 1002 - The database returned an unexpected error. 

See less

Full Trace:


superset_cache          | 1:M 14 Jul 2022 14:18:00.250 * Background saving terminated with success
superset_app            | 127.0.0.1 - - [14/Jul/2022:14:18:24 +0000] "GET /health HTTP/1.1" 200 2 "-" "curl/7.74.0"
superset_app            | 172.20.0.1 - - [14/Jul/2022:14:18:34 +0000] "GET /api/v1/me/ HTTP/1.1" 200 150 "http://localhost:8088/superset/sqllab/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
superset_app            | 172.20.0.1 - - [14/Jul/2022:14:18:34 +0000] "GET /api/v1/me/ HTTP/1.1" 200 150 "http://localhost:8088/superset/sqllab/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
superset_app            | Triggering query_id: 60
superset_app            | 2022-07-14 14:18:35,809:INFO:superset.sqllab.command:Triggering query_id: 60
superset_app            | Query 60: Executing 1 statement(s)
superset_app            | 2022-07-14 14:18:35,942:INFO:superset.sql_lab:Query 60: Executing 1 statement(s)
superset_app            | Query 60: Set query to 'running'
superset_app            | 2022-07-14 14:18:35,943:INFO:superset.sql_lab:Query 60: Set query to 'running'
superset_app            | Query 60: Running statement 1 out of 1
superset_app            | 2022-07-14 14:18:36,083:INFO:superset.sql_lab:Query 60: Running statement 1 out of 1
superset_app            | Query 60: <class 'RuntimeError'>
superset_app            | Traceback (most recent call last):
superset_app            |   File "/app/superset/sql_lab.py", line 272, in execute_sql_statement
superset_app            |     db_engine_spec.execute(cursor, sql, async_=True)
superset_app            |   File "/app/superset/db_engine_specs/base.py", line 1261, in execute
superset_app            |     raise cls.get_dbapi_mapped_exception(ex)
superset_app            |   File "/app/superset/db_engine_specs/base.py", line 1259, in execute
superset_app            |     cursor.execute(query)
superset_app            |   File "/usr/local/lib/python3.8/site-packages/duckdb_engine/__init__.py", line 97, in execute
superset_app            |     raise e
superset_app            |   File "/usr/local/lib/python3.8/site-packages/duckdb_engine/__init__.py", line 85, in execute
superset_app            |     self.c.execute(statement)
superset_app            | RuntimeError: Catalog Error: unrecognized configuration parameter "s3_region"
superset_app            | 
superset_app            | Did you mean: "TimeZone"
superset_app            | 2022-07-14 14:18:36,157:ERROR:superset.sql_lab:Query 60: <class 'RuntimeError'>
superset_app            | Traceback (most recent call last):
superset_app            |   File "/app/superset/sql_lab.py", line 272, in execute_sql_statement
superset_app            |     db_engine_spec.execute(cursor, sql, async_=True)
superset_app            |   File "/app/superset/db_engine_specs/base.py", line 1261, in execute
superset_app            |     raise cls.get_dbapi_mapped_exception(ex)
superset_app            |   File "/app/superset/db_engine_specs/base.py", line 1259, in execute
superset_app            |     cursor.execute(query)
superset_app            |   File "/usr/local/lib/python3.8/site-packages/duckdb_engine/__init__.py", line 97, in execute
superset_app            |     raise e
superset_app            |   File "/usr/local/lib/python3.8/site-packages/duckdb_engine/__init__.py", line 85, in execute
superset_app            |     self.c.execute(statement)
superset_app            | RuntimeError: Catalog Error: unrecognized configuration parameter "s3_region"
superset_app            | 
superset_app            | Did you mean: "TimeZone"
superset_app            | [SupersetError(message='duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"\n\nDid you mean: "TimeZone"', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'DuckDB', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]
superset_app            | Traceback (most recent call last):
superset_app            |   File "/app/superset/views/base.py", line 207, in wraps
superset_app            |     return f(self, *args, **kwargs)
superset_app            |   File "/app/superset/utils/log.py", line 244, in wrapper
superset_app            |     value = f(*args, **kwargs)
superset_app            |   File "/app/superset/views/core.py", line 2412, in sql_json
superset_app            |     command_result: CommandResult = command.run()
superset_app            |   File "/app/superset/sqllab/command.py", line 114, in run
superset_app            |     raise ex
superset_app            |   File "/app/superset/sqllab/command.py", line 98, in run
superset_app            |     status = self._run_sql_json_exec_from_scratch()
superset_app            |   File "/app/superset/sqllab/command.py", line 151, in _run_sql_json_exec_from_scratch
superset_app            |     raise ex
superset_app            |   File "/app/superset/sqllab/command.py", line 146, in _run_sql_json_exec_from_scratch
superset_app            |     return self._sql_json_executor.execute(
superset_app            |   File "/app/superset/sqllab/sql_json_executer.py", line 111, in execute
superset_app            |     raise SupersetErrorsException(
superset_app            | superset.exceptions.SupersetErrorsException: [SupersetError(message='duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"\n\nDid you mean: "TimeZone"', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'DuckDB', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]
superset_app            | 2022-07-14 14:18:36,229:WARNING:superset.views.base:[SupersetError(message='duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"\n\nDid you mean: "TimeZone"', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'DuckDB', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]
superset_app            | Traceback (most recent call last):
superset_app            |   File "/app/superset/views/base.py", line 207, in wraps
superset_app            |     return f(self, *args, **kwargs)
superset_app            |   File "/app/superset/utils/log.py", line 244, in wrapper
superset_app            |     value = f(*args, **kwargs)
superset_app            |   File "/app/superset/views/core.py", line 2412, in sql_json
superset_app            |     command_result: CommandResult = command.run()
superset_app            |   File "/app/superset/sqllab/command.py", line 114, in run
superset_app            |     raise ex
superset_app            |   File "/app/superset/sqllab/command.py", line 98, in run
superset_app            |     status = self._run_sql_json_exec_from_scratch()
superset_app            |   File "/app/superset/sqllab/command.py", line 151, in _run_sql_json_exec_from_scratch
superset_app            |     raise ex
superset_app            |   File "/app/superset/sqllab/command.py", line 146, in _run_sql_json_exec_from_scratch
superset_app            |     return self._sql_json_executor.execute(
superset_app            |   File "/app/superset/sqllab/sql_json_executer.py", line 111, in execute
superset_app            |     raise SupersetErrorsException(
superset_app            | superset.exceptions.SupersetErrorsException: [SupersetError(message='duckdb error: Catalog Error: unrecognized configuration parameter "s3_region"\n\nDid you mean: "TimeZone"', error_type=<SupersetErrorType.GENERIC_DB_ENGINE_ERROR: 'GENERIC_DB_ENGINE_ERROR'>, level=<ErrorLevel.ERROR: 'error'>, extra={'engine_name': 'DuckDB', 'issue_codes': [{'code': 1002, 'message': 'Issue 1002 - The database returned an unexpected error.'}]})]

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:19 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
alitrackcommented, Aug 11, 2022

@Mageswaran1989 , @Mause

I tested with minio,

/usr/local/opt/minio/bin/minio server --config-dir=/usr/local/etc/minio --address=:9900 /usr/local/var/minio

and it works,

install 'httpfs';
load 'httpfs';
SET s3_endpoint='127.0.0.1:9900';
SET s3_access_key_id='minioadmin';
SET s3_secret_access_key='minioadmin';
SET s3_url_style = 'path';
SET s3_use_ssl=false;
select count(*) from 's3://ontime/*.parquet';
image

you need check Allow DML

image
1reaction
Mausecommented, Aug 23, 2022

attention, superset config does not support boolean, must be string

If it’s JSON, it probably just needs to be false instead of False

Read more comments on GitHub >

github_iconTop Results From Across the Web

S3 or GCS Parquet Import - DuckDB
After the HTTPFS extension is set up and the S3 configuration is set correctly, Parquet files can be read from S3 using the...
Read more >
S3 Parquet Import - DuckDB
To load a Parquet file from S3, the HTTPFS extension is required. This can be installed use the INSTALL SQL command. This only...
Read more >
Querying Parquet with Precision using DuckDB
DuckDB's zero-dependency Parquet reader is able to directly execute SQL queries on Parquet files without any import or analysis step. Because of ...
Read more >
DuckDB quacks Arrow: A zero-copy data integration between ...
The datasets may span multiple files in Parquet, CSV, or other formats, and files may even be on remote or cloud storage like...
Read more >
Parquet - DuckDB
DuckDB provides support for both reading and writing Parquet files in an ... entire database as parquet EXPORT DATABASE 'target_directory' (FORMAT PARQUET); ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found