question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GCS slow overwrite parquet partition

See original GitHub issue
19/09/19 11:49:05 INFO FileUtils: deleting  gs://rbuck/folder/ggo/hive_table_any/month=6/year=2018/xyz=abc/part-00073-9dbb91f8-6041-402f-a093-1061bb9ffaa8.c000
19/09/19 11:51:17 INFO Hive: Replacing src:gs://rbuck/folder/ggo/hive_table_any/.hive-staging_hive_2019-09-19_10-31-36_055_8543786963981631342-1/-ext-10000/month=6/year=2018/xyz=abc/part-00073-4a075b3d-61d3-46e9-8c28-ac2822ff0350.c000, dest: gs://rbuck/folder/ggo/hive_table_any/month=6/year=2018/xyz=abc/part-00073-4a075b3d-61d3-46e9-8c28-ac2822ff0350.c000, Status:true
19/09/19 11:51:17 INFO Hive: New loading path = gs://rbuck/folder/ggo/hive_table_any/.hive-staging_hive_2019-09-19_10-31-36_055_8543786963981631342-1/-ext-10000/month=6/year=2018/xyz=abc with partSpec {month=6, year=2018, xyz=abc}
19/09/19 11:51:17 INFO FileUtils: deleting  gs://rbuck/folder/ggo/hive_table_any/month=2/year=2018/xyz=abc/part-00029-9dbb91f8-6041-402f-a093-1061bb9ffaa8.c000
19/09/19 11:53:21 INFO Hive: Replacing src:gs://rbuck/folder/ggo/hive_table_any/.hive-staging_hive_2019-09-19_10-31-36_055_8543786963981631342-1/-ext-10000/month=2/year=2018/xyz=abc/part-00029-4a075b3d-61d3-46e9-8c28-ac2822ff0350.c000, dest: gs://rbuck/folder/ggo/hive_table_any/month=2/year=2018/xyz=abc/part-00029-4a075b3d-61d3-46e9-8c28-ac2822ff0350.c000, Status:true
19/09/19 11:53:21 INFO Hive: New loading path = gs://rbuck/folder/ggo/hive_table_any/.hive-staging_hive_2019-09-19_10-31-36_055_8543786963981631342-1/-ext-10000/month=2/year=2018/xyz=abc with partSpec {month=2, year=2018, xyz=abc}
19/09/19 11:53:21 INFO FileUtils: deleting  gs://rbuck/folder/ggo/hive_table_any/month=8/year=2019/xyz=abc/part-00065-9dbb91f8-6041-402f-a093-1061bb9ffaa8.c000
19/09/19 11:55:22 INFO Hive: Replacing src:gs://rbuck/folder/ggo/hive_table_any/.hive-staging_hive_2019-09-19_10-31-36_055_8543786963981631342-1/-ext-10000/month=8/year=2019/xyz=abc/part-00065-4a075b3d-61d3-46e9-8c28-ac2822ff0350.c000, dest: gs://rbuck/folder/ggo/hive_table_any/month=8/year=2019/xyz=abc/part-00065-4a075b3d-61d3-46e9-8c28-ac2822ff0350.c000, Status:true
19/09/19 11:55:22 INFO Hive: New loading path = gs://rbuck/folder/ggo/hive_table_any/.hive-staging_hive_2019-09-19_10-31-36_055_8543786963981631342-1/-ext-10000/month=8/year=2019/xyz=abc with partSpec {month=8, year=2019, xyz=abc}
19/09/19 11:55:23 INFO FileUtils: deleting  gs://rbuck/folder/ggo/hive_table_any/month=9/year=2015/xyz=abc/part-00179-2a442cbd-d1b0-4484-9f44-7b33c5c1b57d.c000

I have a spark job that insert overwrites on hive table based on partitions month year and xyz. The spark job itself takes 4 mins but after that the insert overwrite operation takes 2 hours with around 77 partitions. I am seeing that it is deleting files from google storage which is taking 2 mins per file. The actual size of the file is just in KBs. How do I speed up this deletion process? Can we use the rsync tool? Are there any configuration settings that will help? We have set up GCS with default parameters.

Side note: I used these settings and that brought down the deletion from 2 mins to 1 mins.

--conf spark.hadoop.fs.gs.batch.threads=16 
--conf spark.hadoop.parquet.enable.summary-metadata=false 
--conf spark.sql.parquet.mergeSchema=false 
--conf spark.sql.parquet.filterPushdown=true 
--conf spark.sql.hive.metastorePartitionPruning=true 
--conf spark.sql.parquet.cacheMetadata=true 
--conf spark.hadoop.fs.gs.performance.cache.enable=true 
--conf spark.hadoop.fs.gs.status.parallel.enable=true

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
pandareencommented, Oct 31, 2019

Thanks for clarification. On the other hand, I’ve changed table type from parquet to ORC. Now instead of 170 mins it takes 7 mins. I’ll close this as its not something from GCS end.

1reaction
medbcommented, Oct 31, 2019

One thing you can try - if you are using dynamic partitioning then you can refactor your single query with dynamic partitioning to multiple queries with static partitioning (one static query per each partition) - it could improve overall performance. You can find details on how to do this in this post.

Read more comments on GitHub >

github_iconTop Results From Across the Web

GCS slow overwrite parquet partition · Issue #270 - GitHub
The spark job itself takes 4 mins but after that the insert overwrite operation takes 2 hours with around 77 partitions.
Read more >
parquet write to gs:// slow - Google Groups
My setup: 200-core cluster performing a large parquet write (13K 100+/-MB partitions) to Google Storage. After all the partitions complete, it takes another ......
Read more >
Warning on dataproc while using partitionBy on a dataframe
I tried your code and it was indeed slow -- for me it took over 8 minutes. I got a significant speedup (down...
Read more >
Python and Parquet Performance - Data Syndrome
This post outlines how to use all common Python libraries to read and write Parquet format while taking advantage of columnar storage, ...
Read more >
Apache Spark Performance Boosting | by Halil Ertan
Csv and Json data file formats give high write performance but are slower for reading, on the other hand, Parquet file format is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found