question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

We’re starting to use BigQuery heavily but becoming increasingly ‘bottlenecked’ with the performance of moving moderate amounts of data from BigQuery to python.

Here’s a few stats:

  • 29.1s: Pulling 500k rows with 3 columns of data (with cached data) using pandas-gbq
  • 36.5s: Pulling the same query with google-cloud-bigquery - i.e. client.query(query)..to_dataframe()
  • 2.4s: Pulling very similar data - same types, same size, from our existing MSSQL box hosted in AWS (using pd.read_sql). That’s on standard drivers, nothing like turbodbc involved

…so using BigQuery with python is at least an order of magnitude slower than traditional DBs.

We’ve tried exporting tables to CSV on GCS and reading those in, which works fairly well for data processes, though not for exploration.

A few questions - feel free to jump in with partial replies:

  • Are these results expected, or are we doing something very wrong?
  • My prior is that a lot of this slowdown is caused by pulling in HTTP pages, converting to python objects, and then writing those into arrays. Is this approach really scalable? Should pandas-gbq invest resources into getting a format that’s query-able in exploratory workflows that can deal with more reasonable datasets? (or at least encourage Google to)

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:12
  • Comments:45 (20 by maintainers)

github_iconTop GitHub Comments

5reactions
tswastcommented, Feb 22, 2019

Released today (https://cloud.google.com/bigquery/docs/release-notes#february_22_2019) the BigQuery Storage API. It should make getting data into pandas a lot faster: https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas

I’m thinking of adding a parameter: use_bqstorage=False to read_gbq() to optionally use this API.

4reactions
tswastcommented, Oct 30, 2018

I decided to do some profiling today to see where all the time is spent, following this Python profiling guide.

pandas_gbq_bench.py:

import pandas_gbq
pandas_gbq.read_gbq(
    "SELECT * FROM `bigquery-public-data.usa_names.usa_1910_2013`",
    dialect='standard')
$ python -m cProfile -o pandas_gbq.cprof pandas_gbq_bench.py
$ pyprof2calltree -k -i pandas_gbq.cprof

image

$ head -n 100 pandas_gbq_bench.txt
         253357679 function calls (253138077 primitive calls) in 328.945 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    766/1    0.015    0.000  328.947  328.947 {built-in method builtins.exec}
        1    1.400    1.400  328.947  328.947 pandas_gbq_bench.py:2(<module>)
        1    0.000    0.000  327.030  327.030 gbq.py:622(read_gbq)
        1    0.791    0.791  291.680  291.680 gbq.py:327(run_query)
  5552453    2.724    0.000  290.075    0.000 page_iterator.py:197(_items_iter)
       62    0.001    0.000  222.242    3.585 client.py:332(_call_api)
    64/60    0.001    0.000  222.241    3.704 retry.py:249(retry_wrapped_func)
    64/60    0.077    0.001  222.240    3.704 retry.py:140(retry_target)
       62    0.001    0.000  222.160    3.583 _http.py:214(api_request)
       57    0.000    0.000  218.804    3.839 page_iterator.py:218(_page_iter)
       57    0.001    0.000  218.804    3.839 page_iterator.py:341(_next_page)
       56    0.001    0.000  218.803    3.907 table.py:1127(_get_next_page_response)
       62    0.000    0.000  206.398    3.329 _http.py:142(_make_request)
       62    0.001    0.000  206.398    3.329 _http.py:185(_do_request)
       62    0.001    0.000  206.396    3.329 requests.py:181(request)
       63    0.001    0.000  206.368    3.276 sessions.py:441(request)
       63    0.003    0.000  205.819    3.267 sessions.py:589(send)
    42157    0.130    0.000  198.129    0.005 socket.py:572(readinto)
    42157    0.110    0.000  197.840    0.005 ssl.py:998(recv_into)
    42157    0.081    0.000  197.700    0.005 ssl.py:863(read)
    42157    0.056    0.000  197.612    0.005 ssl.py:624(read)
    42157  197.556    0.005  197.556    0.005 {method 'read' of '_ssl._SSLSocket' objects}
   102716    0.210    0.000  194.229    0.002 {method 'readline' of '_io.BufferedReader' objects}
       63    0.001    0.000  189.560    3.009 adapters.py:388(send)
       63    0.002    0.000  189.512    3.008 connectionpool.py:447(urlopen)
       63    0.002    0.000  189.476    3.008 connectionpool.py:322(_make_request)
       63    0.001    0.000  188.468    2.992 client.py:1287(getresponse)
       63    0.002    0.000  188.465    2.992 client.py:290(begin)
       63    0.003    0.000  188.380    2.990 client.py:257(_read_status)
  5552508    4.559    0.000   68.547    0.000 page_iterator.py:122(next)
  5552452    6.358    0.000   59.783    0.000 table.py:1321(_item_to_row)
  5552452   32.403    0.000   50.283    0.000 _helpers.py:197(_row_tuple_from_json)
        1    0.291    0.291   32.305   32.305 gbq.py:603(_parse_data)
        1    0.853    0.853   30.564   30.564 frame.py:334(__init__)
        1    0.000    0.000   26.148   26.148 frame.py:7453(_to_arrays)
        1    8.938    8.938   23.952   23.952 __init__.py:130(lmap)
      250    0.001    0.000   16.248    0.065 models.py:810(content)
202383/638    0.702    0.000   16.247    0.025 {method 'join' of 'bytes' objects}
       62    0.071    0.001   15.755    0.254 models.py:868(json)
    57160    0.034    0.000   15.588    0.000 models.py:741(generate)
    57160    0.035    0.000   15.554    0.000 response.py:415(stream)
    57160    0.296    0.000   15.519    0.000 response.py:571(read_chunked)
       67    0.001    0.000   15.042    0.225 __init__.py:302(loads)
       67    0.001    0.000   15.041    0.224 decoder.py:334(decode)
       67   15.039    0.224   15.039    0.224 decoder.py:345(raw_decode)
 33314712    9.321    0.000   13.310    0.000 table.py:1073(__getitem__)
 11104904    6.516    0.000    7.641    0.000 _helpers.py:38(_int_from_json)
   101190    0.280    0.000    6.193    0.000 response.py:535(_update_chunk_length)
   101127    0.249    0.000    5.511    0.000 response.py:549(_handle_chunk)
   201745    0.484    0.000    5.263    0.000 client.py:596(_safe_read)

We see very similar results from google-cloud-bigquery to_dataframe:

from google.cloud import bigquery
client = bigquery.Client()

table_ref = bigquery.TableReference.from_string(
    'bigquery-public-data.usa_names.usa_1910_2013')
table = client.get_table(table_ref)
rows = client.list_rows(table)
rows.to_dataframe()

image

Most of the time is spent actually waiting on the BigQuery API. JSON parsing is a much smaller fraction.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Performance Definition & Meaning - Merriam-Webster
1 · the execution of an action · something accomplished : deed, feat ; 3 · the action of representing a character in...
Read more >
Performance Bicycle - Your Next Best Ride
Shop road, mountain & gravel bikes. Huge selection of parts, components & clothing from Specialized, Shimano & more!
Read more >
Performance - Wikipedia
A performance is an act of staging or presenting a play, concert, or other form of entertainment. It is also defined as the...
Read more >
86 Synonyms & Antonyms for PERFORMANCE - Thesaurus.com
Find 86 ways to say PERFORMANCE, along with antonyms, related words, and example sentences at Thesaurus.com, the world's most trusted free thesaurus.
Read more >
Performance (1970) - IMDb
Rock superstar Mick Jagger and James Fox star in this stunning reality/fantasy trip set. Play trailer ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found