Two scale requests triggered in parallel for the same segment
See original GitHub issueProblem description In one of the platform tests we see the following logs which indicate scale up is triggered in parallel for the same segment:
2017-08-02 18:11:41,670 1091297 [segment-store-46] INFO i.p.s.s.h.stat.AutoScaleProcessor - received traffic for hulk/smallScale/0 with twoMinute rate = 36.92450249113799 and targetRate = 3
2017-08-02 18:11:41,671 1091298 [segment-store-25] INFO i.p.s.s.h.stat.AutoScaleProcessor - received traffic for hulk/smallScale/0 with twoMinute rate = 36.92450249113799 and targetRate = 3
2017-08-02 18:11:41,671 1091298 [segment-store-46] INFO i.p.s.s.h.stat.AutoScaleProcessor - sending request for scale up for hulk/smallScale/0
2017-08-02 18:11:41,671 1091298 [segment-store-25] INFO i.p.s.s.h.stat.AutoScaleProcessor - sending request for scale up for hulk/smallScale/0
Due to this we see segment already exists exception from HDFS in the segment store logs:
2017-08-02 17:56:05,669 155296 [segment-store-43] ERROR i.p.s.s.h.h.PravegaRequestProcessor - Error (Segment = '_system/_commitStream/1', Operation = 'Create segment')
io.pravega.segmentstore.contracts.StreamSegmentExistsException: [Segment '_system/_commitStream/1'] The StreamSegment exists already
at io.pravega.segmentstore.storage.impl.hdfs.HDFSExceptionHelpers.translateFromException(HDFSExceptionHelpers.java:46)
at io.pravega.segmentstore.storage.impl.hdfs.HDFSStorage.handleException(HDFSStorage.java:238)
at io.pravega.segmentstore.storage.impl.hdfs.HDFSStorage.lambda$supplyAsync$1(HDFSStorage.java:227)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: _system/_commitStream/1
at io.pravega.segmentstore.storage.impl.hdfs.HDFSExceptionHelpers.segmentExistsException(HDFSExceptionHelpers.java:63)
at io.pravega.segmentstore.storage.impl.hdfs.CreateOperation.call(CreateOperation.java:48)
at io.pravega.segmentstore.storage.impl.hdfs.CreateOperation.call(CreateOperation.java:29)
at io.pravega.segmentstore.storage.impl.hdfs.HDFSStorage.lambda$supplyAsync$1(HDFSStorage.java:225)
... 7 common frames omitted
Problem location AutoScaleProcessor
Suggestions for an improvement Ensure scale up is attempted only once for a given segment, or make it idempotent without any exceptions in the logs
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
javascript - Async parallel requests are running sequentially
As you have discovered, async.parallel() can only parallelize operations that are themselves asynchronous. If the operations are synchronous, then because ...
Read more >8.1 Parallel Execution Concepts - Oracle Help Center
Parallel execution enables the application of multiple CPU and I/O resources to the execution of a single SQL statement.
Read more >Horizontal Scaling and Request Parallelization for High ...
To help you take advantage of its scale, we encourage you to horizontally scale parallel requests to the Amazon S3 service endpoints.
Read more >Native batch ingestion - Apache Druid
Parallel task indexing ( index_parallel ) that can run multiple indexing ... Batch ingestion only replaces data in segments where it actively adds...
Read more >25 Using Parallel Execution
When a connection is between two processes on the same instance, the servers communicate by passing the buffers back and forth. When the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think there is a race in here:
https://github.com/pravega/pravega/blob/master/segmentstore/server/host/src/main/java/io/pravega/segmentstore/server/host/stat/AutoScaleProcessor.java#L153
because we write the event and then update the cache.
triggerScaleUp
is triggered from append processor, so every append result could end up calling it. If this happens for the same segment concurrently (different connections, distinctAppendProcessor
instances), then I don’t see how we are preventing the duplication.@shiveshr this duplication is occurring because of lack of synchronization in
triggerScaleUp
(I think scale down has the same issue). Consequently, even if we synchronize the whole method, we would only be doing it once we decide to scale, not upon every append.Also, the issue is only upon writing the request, I wonder if we can use a compare-and-set like approach to avoid synchronizing the whole method.
In any case, I think the description of this issue is wrong. The first few log messages refer to:
while the exception refers to:
I think the events are unrelated.