Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Throttling Hive splits discovery & assignment

See original GitHub issue

Hello, We have been experiencing issues recently where some extremely large queries (>1M splits) against Hive tables in a cluster over 100 nodes would flood our HDFS namenode with a very high number of simultaneous read operations. This slowed the namenode down significantly as it was struggling to keep up and started impacting other clients.

We are currently investigating a way to throttle the split discovery/assignment for Hive splits, without throttling the rest of the query stages, the aim being to prevent tasks from all opening too many concurrent splits and basically DDoSing the namenode. One solution we have is to reduce hive.split-loader-concurrency from 4 to 1. However, as much as I understand from looking at the code, this will reduce the parallelism used to list the content of the partitions, and thus indirectly the rate at which splits are discovered, but does not provide guarantees and would likely only be a partial solution as we have seen queries multiply the average load on the namenode by ~10x over the span of a few minutes. Another solution could be to raise the split size but that would potentially penalize medium sized queries. We could also create a new tier of split size for very large queries (eg: > 200K splits), however that won’t prevent the first splits from creating a quick burst of requests to the namenode.

Has anybody already had a look into a potential solution to this? From looking at the code, there does not seem to be a way to limit concurrency only for source stages that read Hive splits independently of the query. One thing we are experimenting with is to allow throttling the rate and size of the batches returned by the AsyncQueue to avoid releasing too many concurrent splits to tasks of the source stage, but we would definitely welcome a better solution if there was one 😃

Thanks!

Issue Analytics

State:
Created 5 years ago
Comments:9 (3 by maintainers)

Top GitHub Comments

1reaction

BenoitHanottecommented, Mar 25, 2019

Hello @nezihyigitbasi , Thanks for your reply!

We have tried to reduce the split loader concurrency, but even with a concurrency of 1, the splits can be assigned to tasks so fast that they can easily trigger over 150K read ops/min: loader-concurrency Ideally we would like to have strong guarantees that bad queries cannot trigger over 60-80K ops/min.

There have indeed been issue in some datasets where small files were not properly consolidated together thus leading to a unreasonably high number of splits. However, as we don’t own the pipelines generating these datasets, we would like to make sure that such an issue won’t lead to presto making read requests to the namenode at a unually high rate.

We have been experimenting with throttling the HiveSplitSource, I pushed a first PR: https://github.com/prestosql/presto/pull/534. This idea is to throttle the borrowBatchAsync method of the AsyncQueue to only release up to hive.max-splits-per-sec splits per second to the source stage. We have tested it and it proved very effective as you can see in the graph below (0 being no throttling): splits-throttling

If you had a few minutes, I would gladly welcome feedback on whether you think this would be the right place to implement such a throttling and also if you think this is a feature that could reach upstream.

Thanks!

1reaction

nezihyigitbasicommented, Mar 19, 2019

AFAIK you can tune two knobs for that, one is hive.split-loader-concurrency as you already figured out, the other one is the number of such queries running at the same time, which is a resource group config problem.

I don’t see any throttling or rate limiting in the split loading code path, so currently I don’t think we have a good way to prevent a potential spike that would happen if a query arrives at the system that enumerates a large number of splits, even with a concurrency of 1.

I think ideally, and if possible, the table that’s being scanned should be fixed to have a smaller number of slightly larger splits as having a large number of splits is a bad physical layout for a table for many reasons. Many deployments use a split size that’s equivalent to an HDFS block size of 64MB or 128MB.

Top Results From Across the Web

Automate partition discovery and repair

Hive automatically and periodically discovers discrepancies in partition metadata in the Hive metastore and corresponding directories on the file system, ...

HiveMQ Clusters

Each HiveMQ node regularly checks all nodes that are included in the static discovery list and tries to include them in the cluster....

Oracle Cloud Infrastructure Streaming FAQ

The Oracle Cloud Infrastructure Streaming service provides a fully managed, scalable, and durable storage option for continuous, high-volume streams of data ...

Known issues - PaperCut

PaperCut NG/MF has a 50 character limit on usernames across all sync sources ... button in Print Deploy does not work when Zones...

Google Cloud metrics - Monitoring

Note that the metric data is only reported while at least one project has been assigned to the reservation and is consuming slots....