Throttling Hive splits discovery & assignment
See original GitHub issueHello, We have been experiencing issues recently where some extremely large queries (>1M splits) against Hive tables in a cluster over 100 nodes would flood our HDFS namenode with a very high number of simultaneous read operations. This slowed the namenode down significantly as it was struggling to keep up and started impacting other clients.
We are currently investigating a way to throttle the split discovery/assignment for Hive splits, without throttling the rest of the query stages, the aim being to prevent tasks from all opening too many concurrent splits and basically DDoSing the namenode.
One solution we have is to reduce hive.split-loader-concurrency
from 4 to 1. However, as much as I understand from looking at the code, this will reduce the parallelism used to list the content of the partitions, and thus indirectly the rate at which splits are discovered, but does not provide guarantees and would likely only be a partial solution as we have seen queries multiply the average load on the namenode by ~10x over the span of a few minutes.
Another solution could be to raise the split size but that would potentially penalize medium sized queries. We could also create a new tier of split size for very large queries (eg: > 200K splits), however that won’t prevent the first splits from creating a quick burst of requests to the namenode.
Has anybody already had a look into a potential solution to this? From looking at the code, there does not seem to be a way to limit concurrency only for source stages that read Hive splits independently of the query.
One thing we are experimenting with is to allow throttling the rate and size of the batches returned by the AsyncQueue
to avoid releasing too many concurrent splits to tasks of the source stage, but we would definitely welcome a better solution if there was one 😃
Thanks!
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (3 by maintainers)
Top GitHub Comments
Hello @nezihyigitbasi , Thanks for your reply!
We have tried to reduce the split loader concurrency, but even with a concurrency of 1, the splits can be assigned to tasks so fast that they can easily trigger over 150K read ops/min: Ideally we would like to have strong guarantees that bad queries cannot trigger over 60-80K ops/min.
There have indeed been issue in some datasets where small files were not properly consolidated together thus leading to a unreasonably high number of splits. However, as we don’t own the pipelines generating these datasets, we would like to make sure that such an issue won’t lead to presto making read requests to the namenode at a unually high rate.
We have been experimenting with throttling the
HiveSplitSource
, I pushed a first PR: https://github.com/prestosql/presto/pull/534. This idea is to throttle theborrowBatchAsync
method of theAsyncQueue
to only release up tohive.max-splits-per-sec
splits per second to the source stage. We have tested it and it proved very effective as you can see in the graph below (0 being no throttling):If you had a few minutes, I would gladly welcome feedback on whether you think this would be the right place to implement such a throttling and also if you think this is a feature that could reach upstream.
Thanks!
AFAIK you can tune two knobs for that, one is
hive.split-loader-concurrency
as you already figured out, the other one is the number of such queries running at the same time, which is a resource group config problem.I don’t see any throttling or rate limiting in the split loading code path, so currently I don’t think we have a good way to prevent a potential spike that would happen if a query arrives at the system that enumerates a large number of splits, even with a concurrency of 1.
I think ideally, and if possible, the table that’s being scanned should be fixed to have a smaller number of slightly larger splits as having a large number of splits is a bad physical layout for a table for many reasons. Many deployments use a split size that’s equivalent to an HDFS block size of 64MB or 128MB.