GCS slow overwrite parquet partition
See original GitHub issue19/09/19 11:49:05 INFO FileUtils: deleting gs://rbuck/folder/ggo/hive_table_any/month=6/year=2018/xyz=abc/part-00073-9dbb91f8-6041-402f-a093-1061bb9ffaa8.c000
19/09/19 11:51:17 INFO Hive: Replacing src:gs://rbuck/folder/ggo/hive_table_any/.hive-staging_hive_2019-09-19_10-31-36_055_8543786963981631342-1/-ext-10000/month=6/year=2018/xyz=abc/part-00073-4a075b3d-61d3-46e9-8c28-ac2822ff0350.c000, dest: gs://rbuck/folder/ggo/hive_table_any/month=6/year=2018/xyz=abc/part-00073-4a075b3d-61d3-46e9-8c28-ac2822ff0350.c000, Status:true
19/09/19 11:51:17 INFO Hive: New loading path = gs://rbuck/folder/ggo/hive_table_any/.hive-staging_hive_2019-09-19_10-31-36_055_8543786963981631342-1/-ext-10000/month=6/year=2018/xyz=abc with partSpec {month=6, year=2018, xyz=abc}
19/09/19 11:51:17 INFO FileUtils: deleting gs://rbuck/folder/ggo/hive_table_any/month=2/year=2018/xyz=abc/part-00029-9dbb91f8-6041-402f-a093-1061bb9ffaa8.c000
19/09/19 11:53:21 INFO Hive: Replacing src:gs://rbuck/folder/ggo/hive_table_any/.hive-staging_hive_2019-09-19_10-31-36_055_8543786963981631342-1/-ext-10000/month=2/year=2018/xyz=abc/part-00029-4a075b3d-61d3-46e9-8c28-ac2822ff0350.c000, dest: gs://rbuck/folder/ggo/hive_table_any/month=2/year=2018/xyz=abc/part-00029-4a075b3d-61d3-46e9-8c28-ac2822ff0350.c000, Status:true
19/09/19 11:53:21 INFO Hive: New loading path = gs://rbuck/folder/ggo/hive_table_any/.hive-staging_hive_2019-09-19_10-31-36_055_8543786963981631342-1/-ext-10000/month=2/year=2018/xyz=abc with partSpec {month=2, year=2018, xyz=abc}
19/09/19 11:53:21 INFO FileUtils: deleting gs://rbuck/folder/ggo/hive_table_any/month=8/year=2019/xyz=abc/part-00065-9dbb91f8-6041-402f-a093-1061bb9ffaa8.c000
19/09/19 11:55:22 INFO Hive: Replacing src:gs://rbuck/folder/ggo/hive_table_any/.hive-staging_hive_2019-09-19_10-31-36_055_8543786963981631342-1/-ext-10000/month=8/year=2019/xyz=abc/part-00065-4a075b3d-61d3-46e9-8c28-ac2822ff0350.c000, dest: gs://rbuck/folder/ggo/hive_table_any/month=8/year=2019/xyz=abc/part-00065-4a075b3d-61d3-46e9-8c28-ac2822ff0350.c000, Status:true
19/09/19 11:55:22 INFO Hive: New loading path = gs://rbuck/folder/ggo/hive_table_any/.hive-staging_hive_2019-09-19_10-31-36_055_8543786963981631342-1/-ext-10000/month=8/year=2019/xyz=abc with partSpec {month=8, year=2019, xyz=abc}
19/09/19 11:55:23 INFO FileUtils: deleting gs://rbuck/folder/ggo/hive_table_any/month=9/year=2015/xyz=abc/part-00179-2a442cbd-d1b0-4484-9f44-7b33c5c1b57d.c000
I have a spark job that insert overwrites on hive table based on partitions month year and xyz. The spark job itself takes 4 mins but after that the insert overwrite operation takes 2 hours with around 77 partitions. I am seeing that it is deleting files from google storage which is taking 2 mins per file. The actual size of the file is just in KBs. How do I speed up this deletion process? Can we use the rsync tool? Are there any configuration settings that will help? We have set up GCS with default parameters.
Side note: I used these settings and that brought down the deletion from 2 mins to 1 mins.
--conf spark.hadoop.fs.gs.batch.threads=16
--conf spark.hadoop.parquet.enable.summary-metadata=false
--conf spark.sql.parquet.mergeSchema=false
--conf spark.sql.parquet.filterPushdown=true
--conf spark.sql.hive.metastorePartitionPruning=true
--conf spark.sql.parquet.cacheMetadata=true
--conf spark.hadoop.fs.gs.performance.cache.enable=true
--conf spark.hadoop.fs.gs.status.parallel.enable=true
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
GCS slow overwrite parquet partition · Issue #270 - GitHub
The spark job itself takes 4 mins but after that the insert overwrite operation takes 2 hours with around 77 partitions.
Read more >parquet write to gs:// slow - Google Groups
My setup: 200-core cluster performing a large parquet write (13K 100+/-MB partitions) to Google Storage. After all the partitions complete, it takes another ......
Read more >Warning on dataproc while using partitionBy on a dataframe
I tried your code and it was indeed slow -- for me it took over 8 minutes. I got a significant speedup (down...
Read more >Python and Parquet Performance - Data Syndrome
This post outlines how to use all common Python libraries to read and write Parquet format while taking advantage of columnar storage, ...
Read more >Apache Spark Performance Boosting | by Halil Ertan
Csv and Json data file formats give high write performance but are slower for reading, on the other hand, Parquet file format is...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks for clarification. On the other hand, I’ve changed table type from parquet to ORC. Now instead of 170 mins it takes 7 mins. I’ll close this as its not something from GCS end.
One thing you can try - if you are using dynamic partitioning then you can refactor your single query with dynamic partitioning to multiple queries with static partitioning (one static query per each partition) - it could improve overall performance. You can find details on how to do this in this post.