question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BigQuery: to_arrow() method similar to to_dataframe()

See original GitHub issue

Current there’s a to_dataframe() method that returns a pandas DataFrame from a query. DataFrames don’t efficiently support array and struct values, but pyarrow provides efficient support for them:

In [11]: import pyarrow as pa

In [13]: a = pa.array(
    ...:     [
    ...:         {'a': [1, 2, 3]}
    ...:     ],
    ...:     pa.struct([pa.field('a', pa.list_(pa.int64()))])
    ...: )

In [14]: a
Out[14]: 
<pyarrow.lib.StructArray object at 0x7f55f8764b88>
[
  {'a': [1, 2, 3]}
]

A to_arrow() method will make it easier to efficiently support more complex types going forward in downstream libraries like ibis.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:3
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
tswastcommented, Jul 12, 2019

FYI: results.to_arrow(bqstorage_client=bqstorage_client).to_pandas() is currently the fastest way to get a pandas DataFrame from your query results. About 4 seconds from results to DataFrame for a 125 MB table.

0reactions
tswastcommented, Jul 16, 2019

@plamut Remaining task to close out this FR is to add the bqstorage_client argument to QueryJob.to_arrow()

https://github.com/googleapis/google-cloud-python/blob/e4cf3f4458d13ce5b0203060dc4249d3e34e80a7/bigquery/google/cloud/bigquery/job.py#L2899

See RowIterator.to_arrow():

https://github.com/googleapis/google-cloud-python/blob/e4cf3f4458d13ce5b0203060dc4249d3e34e80a7/bigquery/google/cloud/bigquery/table.py#L1452

Also, we should add comments before to_arrow and to_dataframe in both QueryJob and RowIterator reminding us to add any additional arguments in the other class.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Visualize BigQuery data in Jupyter notebooks - Google Cloud
This tutorial describes how to explore and visualize data by using the BigQuery client library for Python and pandas in a managed Jupyter...
Read more >
Fetching data from BigQuery taking very long - Stack Overflow
A: to_dataframe() - Uses BigQuery tabledata.list API. ... D: to_arrow(bqstorage_client=bqstorage_client).to_pandas() , package version ...
Read more >
Load Data into GCP BigQuery Table using pandas DataFrame
Google provides a few ways to load data to GCP BigQuery tables programmatically. One of the popular method is to use BigQuery API...
Read more >
BigQuery - Feast
The BigQuery offline store provides support for reading BigQuerySources. All joins happen within BigQuery. Entity dataframes can be provided as a SQL query ......
Read more >
How to integrate BigQuery & Pandas - Kaggle
UPDATE: since this kernel was first published, the core bigquery API has added a new to_dataframe() method that makes exporting to a dataframe...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found