BigQuery: to_arrow() method similar to to_dataframe()
See original GitHub issueCurrent there’s a to_dataframe()
method that returns a pandas DataFrame
from a query. DataFrame
s don’t efficiently support array and struct values, but pyarrow
provides efficient support for them:
In [11]: import pyarrow as pa
In [13]: a = pa.array(
...: [
...: {'a': [1, 2, 3]}
...: ],
...: pa.struct([pa.field('a', pa.list_(pa.int64()))])
...: )
In [14]: a
Out[14]:
<pyarrow.lib.StructArray object at 0x7f55f8764b88>
[
{'a': [1, 2, 3]}
]
A to_arrow()
method will make it easier to efficiently support more complex types going forward in downstream libraries like ibis.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:3
- Comments:7 (6 by maintainers)
Top Results From Across the Web
Visualize BigQuery data in Jupyter notebooks - Google Cloud
This tutorial describes how to explore and visualize data by using the BigQuery client library for Python and pandas in a managed Jupyter...
Read more >Fetching data from BigQuery taking very long - Stack Overflow
A: to_dataframe() - Uses BigQuery tabledata.list API. ... D: to_arrow(bqstorage_client=bqstorage_client).to_pandas() , package version ...
Read more >Load Data into GCP BigQuery Table using pandas DataFrame
Google provides a few ways to load data to GCP BigQuery tables programmatically. One of the popular method is to use BigQuery API...
Read more >BigQuery - Feast
The BigQuery offline store provides support for reading BigQuerySources. All joins happen within BigQuery. Entity dataframes can be provided as a SQL query ......
Read more >How to integrate BigQuery & Pandas - Kaggle
UPDATE: since this kernel was first published, the core bigquery API has added a new to_dataframe() method that makes exporting to a dataframe...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
FYI:
results.to_arrow(bqstorage_client=bqstorage_client).to_pandas()
is currently the fastest way to get a pandas DataFrame from your query results. About 4 seconds from results to DataFrame for a 125 MB table.@plamut Remaining task to close out this FR is to add the
bqstorage_client
argument toQueryJob.to_arrow()
https://github.com/googleapis/google-cloud-python/blob/e4cf3f4458d13ce5b0203060dc4249d3e34e80a7/bigquery/google/cloud/bigquery/job.py#L2899
See
RowIterator.to_arrow()
:https://github.com/googleapis/google-cloud-python/blob/e4cf3f4458d13ce5b0203060dc4249d3e34e80a7/bigquery/google/cloud/bigquery/table.py#L1452
Also, we should add comments before
to_arrow
andto_dataframe
in bothQueryJob
andRowIterator
reminding us to add any additional arguments in the other class.