Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BigQuery: to_arrow() method similar to to_dataframe()

See original GitHub issue

Current there’s a to_dataframe() method that returns a pandas DataFrame from a query. DataFrames don’t efficiently support array and struct values, but pyarrow provides efficient support for them:

In [11]: import pyarrow as pa

In [13]: a = pa.array(
    ...:     [
    ...:         {'a': [1, 2, 3]}
    ...:     ],
    ...:     pa.struct([pa.field('a', pa.list_(pa.int64()))])
    ...: )

In [14]: a
Out[14]: 
<pyarrow.lib.StructArray object at 0x7f55f8764b88>
[
  {'a': [1, 2, 3]}
]

A to_arrow() method will make it easier to efficiently support more complex types going forward in downstream libraries like ibis.

Issue Analytics

State:
Created 5 years ago
Reactions:3
Comments:7 (6 by maintainers)

Top GitHub Comments

2reactions

tswastcommented, Jul 12, 2019

FYI: results.to_arrow(bqstorage_client=bqstorage_client).to_pandas() is currently the fastest way to get a pandas DataFrame from your query results. About 4 seconds from results to DataFrame for a 125 MB table.

0reactions

tswastcommented, Jul 16, 2019

@plamut Remaining task to close out this FR is to add the bqstorage_client argument to QueryJob.to_arrow()

https://github.com/googleapis/google-cloud-python/blob/e4cf3f4458d13ce5b0203060dc4249d3e34e80a7/bigquery/google/cloud/bigquery/job.py#L2899

See RowIterator.to_arrow():

https://github.com/googleapis/google-cloud-python/blob/e4cf3f4458d13ce5b0203060dc4249d3e34e80a7/bigquery/google/cloud/bigquery/table.py#L1452

Also, we should add comments before to_arrow and to_dataframe in both QueryJob and RowIterator reminding us to add any additional arguments in the other class.

Top Results From Across the Web

Visualize BigQuery data in Jupyter notebooks - Google Cloud

This tutorial describes how to explore and visualize data by using the BigQuery client library for Python and pandas in a managed Jupyter...

Fetching data from BigQuery taking very long - Stack Overflow

A: to_dataframe() - Uses BigQuery tabledata.list API. ... D: to_arrow(bqstorage_client=bqstorage_client).to_pandas() , package version ...

Load Data into GCP BigQuery Table using pandas DataFrame

Google provides a few ways to load data to GCP BigQuery tables programmatically. One of the popular method is to use BigQuery API...

BigQuery - Feast

The BigQuery offline store provides support for reading BigQuerySources. All joins happen within BigQuery. Entity dataframes can be provided as a SQL query ......

How to integrate BigQuery & Pandas - Kaggle

UPDATE: since this kernel was first published, the core bigquery API has added a new to_dataframe() method that makes exporting to a dataframe...