question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

closing connection chunks in DbApiHook.get_pandas_df

See original GitHub issue

Apache Airflow version

2.2.5 (latest released)

What happened

Hi all, Please be patient with me, it’s my first Bugreport in git at all 😃

Affected function: DbApiHook.get_pandas_df

Short description: If I use DbApiHook.get_pandas_df with parameter “chunksize” the connection is lost

Error description I tried using the DbApiHook.get_pandas_df function instead of pandas.read_sql. Without the parameter “chunksize” both functions work the same. But as soon as I add the parameter chunksize to get_pandas_df, I lose the connection in the first iteration. This happens both when querying Oracle and Mysql (Mariadb) databases.

During my research I found a comment on a closed issue that describes the same -> #8468

My Airflow version: 2.2.5

I think it’s something to do with the “with closing” argument, because when I remove that argument, the chunksize argument was working.

def get_pandas_df(self, sql, parameters=None, **kwargs):
        """
        Executes the sql and returns a pandas dataframe
        :param sql: the sql statement to be executed (str) or a list of
            sql statements to execute
        :param parameters: The parameters to render the SQL query with.
        :param kwargs: (optional) passed into pandas.io.sql.read_sql method
        """
        try:
            from pandas.io import sql as psql
        except ImportError:
            raise Exception("pandas library not installed, run: pip install 'apache-airflow[pandas]'.")
       # Not working
        with closing(self.get_conn()) as conn:
                return psql.read_sql(sql, con=conn, params=parameters, **kwargs)
       # would working
       # return psql.read_sql(sql, con=conn, params=parameters, **kwargs)_

What you think should happen instead

It should give me a chunk of DataFrame

How to reproduce

not working

src_hook = OracleHook(oracle_conn_id='oracle_source_conn_id')
query = "select * from example_table" 
for chunk in src_hook.get_pandas_df(query,chunksize=2):
    print(chunk.head())

works

for chunk in src_hook.get_pandas_df(query):
    print(chunk.head())

works

for chunk in pandas.read_sql(query,src_hook.get_conn(),chunksize=2):
    print(chunk.head())

Operating System

MacOS Monetäre

Versions of Apache Airflow Providers

apache-airflow 2.2.5 apache-airflow-providers-ftp 2.1.2 apache-airflow-providers-http 2.1.2 apache-airflow-providers-imap 2.2.3 apache-airflow-providers-microsoft-mssql 2.1.3 apache-airflow-providers-mongo 2.3.3 apache-airflow-providers-mysql 2.2.3 apache-airflow-providers-oracle 2.2.3 apache-airflow-providers-salesforce 3.4.3 apache-airflow-providers-sftp 2.5.2 apache-airflow-providers-sqlite 2.1.3 apache-airflow-providers-ssh 2.4.3

Deployment

Virtualenv installation

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
potiukcommented, Apr 20, 2022

assigned you

0reactions
hubert-pietroncommented, Apr 20, 2022

If @bauerfranz is not interested in fixing this bug i would want to work on this

Read more comments on GitHub >

github_iconTop Results From Across the Web

DbApiHook: add chunksize to get_pandas_df parameters ...
There are a problem with the chunksize, because the get_pandas_df close the connection in the first call of the iterator[DataFrame] returned ...
Read more >
Reading a SQL table by chunks with Pandas
In the following export_csv function, we create a connection, read the data by chunks with read_sql() and append the rows to a CSV...
Read more >
Airflow Documentation - Read the Docs
Airflow is a platform to programmatically author, schedule and monitor workflows. Use airflow to author workflows as directed acyclic graphs ...
Read more >
Get pymysql.connect cursor.execute() chunks in Python
I want to get the chunks from the SQL query as I did in the code above using pymysql: Conn = pymysql.connect(user=Config[0], ...
Read more >
sql-in-rmarkdown - Sciencificity's Blog - Netlify
Write data to a SQLite DB; Connect to the Database; Code Chunks ... I closed the connection here since the tasks of writing...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found