Response bodies tables too big to load into BigQuery in one go (15 TB)
See original GitHub issueThe September 2020 crawl completed successfully except for the response_bodies
table for mobile. Inspecting the Dataflow logs shows this as the error:
Error while reading table: 2020_09_01_mobile_5b8f67b4_219e_41c5_9877_94ee0290cc64_source, error message: Total JSON data size exceeds max allowed size. Total size is at least: 16556745870240. Max allowed size is: 16492674416640.
16492674416640 bytes is 15 TB, the maximum size per BigQuery load job. The corresponding September 2020 desktop table weighs in at 10.82 TB and the August 2020 mobile table is 14.47 TB, so it’s plausible that the mobile table finally exceeded 15 TB this month. The underlying CrUX dataset is continuing to grow, so this is another one of the stresses on the data pipeline capacity.
Table | Rows | Bytes (TB) |
---|---|---|
2020_08_01_desktop | 215,621,667 | 10.99 |
2020_08_01_mobile | 270,249,686 | 14.47 |
2020_09_01_desktop | 216,083,365 | 10.82 |
2020_09_01_mobile | 291,589,220 | 15.06 ❌ |
Here’s a look at how the response body sizes were distributed in 2020_08_01_mobile:
MB | requests | cumulative weight (GB) | cumulative requests |
---|---|---|---|
90 | 1 | 0.09 | 1 |
87 | 1 | 0.17 | 2 |
86 | 1 | 0.26 | 3 |
84 | 1 | 0.34 | 4 |
83 | 1 | 0.42 | 5 |
78 | 1 | 0.50 | 6 |
75 | 1 | 0.57 | 7 |
74 | 1 | 0.64 | 8 |
68 | 2 | 0.77 | 10 |
62 | 1 | 0.83 | 11 |
61 | 2 | 0.95 | 13 |
59 | 1 | 1.01 | 14 |
56 | 1 | 1.07 | 15 |
55 | 2 | 1.17 | 17 |
54 | 1 | 1.23 | 18 |
50 | 2 | 1.32 | 20 |
48 | 3 | 1.46 | 23 |
47 | 3 | 1.60 | 26 |
45 | 1 | 1.65 | 27 |
44 | 5 | 1.86 | 32 |
43 | 1 | 1.90 | 33 |
41 | 2 | 1.98 | 35 |
40 | 2 | 2.06 | 37 |
38 | 2 | 2.14 | 39 |
37 | 4 | 2.28 | 43 |
36 | 5 | 2.46 | 48 |
35 | 6 | 2.66 | 54 |
34 | 1 | 2.69 | 55 |
32 | 2 | 2.76 | 57 |
31 | 2 | 2.82 | 59 |
30 | 3 | 2.91 | 62 |
29 | 3 | 2.99 | 65 |
28 | 6 | 3.15 | 71 |
27 | 2 | 3.21 | 73 |
26 | 7 | 3.38 | 80 |
25 | 4 | 3.48 | 84 |
24 | 6 | 3.62 | 90 |
23 | 25 | 4.18 | 115 |
22 | 15 | 4.51 | 130 |
21 | 32 | 5.16 | 162 |
20 | 136 | 7.82 | 298 |
19 | 35 | 8.47 | 333 |
18 | 44 | 9.24 | 377 |
17 | 50 | 10.07 | 427 |
16 | 62 | 11.04 | 489 |
15 | 79 | 12.20 | 568 |
14 | 140 | 14.11 | 708 |
13 | 166 | 16.22 | 874 |
12 | 287 | 19.58 | 1,161 |
11 | 276 | 22.55 | 1,437 |
10 | 536 | 27.78 | 1,973 |
9 | 687 | 33.82 | 2,660 |
8 | 1,387 | 44.66 | 4,047 |
7 | 1,876 | 57.48 | 5,923 |
6 | 2,537 | 72.35 | 8,460 |
5 | 6,896 | 106.02 | 15,356 |
4 | 9,898 | 144.68 | 25,254 |
3 | 30,386 | 233.70 | 55,640 |
2 | 103,328 | 435.52 | 158,968 |
1 | 1,438,555 | 1840.35 | 1,597,523 |
0 | 268,652,163 | 1840.35 | 270,249,686 |
Some hand-wavey math later, I think what this is telling me is that if we reduce the row limit back down to 2 MB from 100, we can save up to 606 GB, assuming each row also has an average of 100 bytes of bookkeeping (page, url, truncated, requestid). This should be enough headroom to offset the dataset growth and get us under the limit.
I’ll rerun the Dataflow job with a limit of 2 MB and report back if it works.
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (10 by maintainers)
Last year the response bodies processing was commented out since the dataset size exceeded 15TB and was causing issues with the DataFlow pipeline. You can see where this is commented out here - https://github.com/HTTPArchive/bigquery/blob/master/dataflow/python/bigquery_import.py#L331
The HAR files still exist in GCS, so once this is resolved we could backfill the data.
Fixed by https://github.com/HTTPArchive/bigquery/pull/123