question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Response bodies tables too big to load into BigQuery in one go (15 TB)

See original GitHub issue

The September 2020 crawl completed successfully except for the response_bodies table for mobile. Inspecting the Dataflow logs shows this as the error:

Error while reading table: 2020_09_01_mobile_5b8f67b4_219e_41c5_9877_94ee0290cc64_source, error message: Total JSON data size exceeds max allowed size. Total size is at least: 16556745870240. Max allowed size is: 16492674416640.

16492674416640 bytes is 15 TB, the maximum size per BigQuery load job. The corresponding September 2020 desktop table weighs in at 10.82 TB and the August 2020 mobile table is 14.47 TB, so it’s plausible that the mobile table finally exceeded 15 TB this month. The underlying CrUX dataset is continuing to grow, so this is another one of the stresses on the data pipeline capacity.

Table Rows Bytes (TB)
2020_08_01_desktop 215,621,667 10.99
2020_08_01_mobile 270,249,686 14.47
2020_09_01_desktop 216,083,365 10.82
2020_09_01_mobile 291,589,220 15.06 ❌

Here’s a look at how the response body sizes were distributed in 2020_08_01_mobile:

MB requests cumulative weight (GB) cumulative requests
90 1 0.09 1
87 1 0.17 2
86 1 0.26 3
84 1 0.34 4
83 1 0.42 5
78 1 0.50 6
75 1 0.57 7
74 1 0.64 8
68 2 0.77 10
62 1 0.83 11
61 2 0.95 13
59 1 1.01 14
56 1 1.07 15
55 2 1.17 17
54 1 1.23 18
50 2 1.32 20
48 3 1.46 23
47 3 1.60 26
45 1 1.65 27
44 5 1.86 32
43 1 1.90 33
41 2 1.98 35
40 2 2.06 37
38 2 2.14 39
37 4 2.28 43
36 5 2.46 48
35 6 2.66 54
34 1 2.69 55
32 2 2.76 57
31 2 2.82 59
30 3 2.91 62
29 3 2.99 65
28 6 3.15 71
27 2 3.21 73
26 7 3.38 80
25 4 3.48 84
24 6 3.62 90
23 25 4.18 115
22 15 4.51 130
21 32 5.16 162
20 136 7.82 298
19 35 8.47 333
18 44 9.24 377
17 50 10.07 427
16 62 11.04 489
15 79 12.20 568
14 140 14.11 708
13 166 16.22 874
12 287 19.58 1,161
11 276 22.55 1,437
10 536 27.78 1,973
9 687 33.82 2,660
8 1,387 44.66 4,047
7 1,876 57.48 5,923
6 2,537 72.35 8,460
5 6,896 106.02 15,356
4 9,898 144.68 25,254
3 30,386 233.70 55,640
2 103,328 435.52 158,968
1 1,438,555 1840.35 1,597,523
0 268,652,163 1840.35 270,249,686

Some hand-wavey math later, I think what this is telling me is that if we reduce the row limit back down to 2 MB from 100, we can save up to 606 GB, assuming each row also has an average of 100 bytes of bookkeeping (page, url, truncated, requestid). This should be enough headroom to offset the dataset growth and get us under the limit.

I’ll rerun the Dataflow job with a limit of 2 MB and report back if it works.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
paulcalvanocommented, Mar 17, 2021

Last year the response bodies processing was commented out since the dataset size exceeded 15TB and was causing issues with the DataFlow pipeline. You can see where this is commented out here - https://github.com/HTTPArchive/bigquery/blob/master/dataflow/python/bigquery_import.py#L331

The HAR files still exist in GCS, so once this is resolved we could backfill the data.

0reactions
rviscomicommented, Aug 18, 2021
Read more comments on GitHub >

github_iconTop Results From Across the Web

Quotas and limits | BigQuery
This document lists the quotas and limits that apply to BigQuery. A quota restricts how much of a particular shared Google Cloud resource...
Read more >
Chapter 4. Loading Data into BigQuery
If you have a larger dataset, split it across multiple files, each smaller than 5 TB. However, a single load job can submit...
Read more >
How to get BigQuery storage size for a single table
When i am getting total bytes for larger table it returns 18200091100 (16.95 GB) i did not calculate smaller tables as of now,...
Read more >
Loading data into BigQuery · DEoGC - rindra
Also, schema auto-detection is not used with Avro or Google Cloud Datastore backup files. When you load Avro or Cloud Datastore backup data,...
Read more >
bigquery
If you've been following so far, extracting data from a BigQuery table into a Google Cloud Storage object will feel familiar.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found