Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Two scale requests triggered in parallel for the same segment

See original GitHub issue

Problem description In one of the platform tests we see the following logs which indicate scale up is triggered in parallel for the same segment:

2017-08-02 18:11:41,670 1091297 [segment-store-46] INFO  i.p.s.s.h.stat.AutoScaleProcessor - received traffic for hulk/smallScale/0 with twoMinute rate = 36.92450249113799 and targetRate = 3
2017-08-02 18:11:41,671 1091298 [segment-store-25] INFO  i.p.s.s.h.stat.AutoScaleProcessor - received traffic for hulk/smallScale/0 with twoMinute rate = 36.92450249113799 and targetRate = 3
2017-08-02 18:11:41,671 1091298 [segment-store-46] INFO  i.p.s.s.h.stat.AutoScaleProcessor - sending request for scale up for hulk/smallScale/0
2017-08-02 18:11:41,671 1091298 [segment-store-25] INFO  i.p.s.s.h.stat.AutoScaleProcessor - sending request for scale up for hulk/smallScale/0

Due to this we see segment already exists exception from HDFS in the segment store logs:

2017-08-02 17:56:05,669 155296 [segment-store-43] ERROR i.p.s.s.h.h.PravegaRequestProcessor - Error (Segment = '_system/_commitStream/1', Operation = 'Create segment')
io.pravega.segmentstore.contracts.StreamSegmentExistsException: [Segment '_system/_commitStream/1'] The StreamSegment exists already
	at io.pravega.segmentstore.storage.impl.hdfs.HDFSExceptionHelpers.translateFromException(HDFSExceptionHelpers.java:46)
	at io.pravega.segmentstore.storage.impl.hdfs.HDFSStorage.handleException(HDFSStorage.java:238)
	at io.pravega.segmentstore.storage.impl.hdfs.HDFSStorage.lambda$supplyAsync$1(HDFSStorage.java:227)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: _system/_commitStream/1
	at io.pravega.segmentstore.storage.impl.hdfs.HDFSExceptionHelpers.segmentExistsException(HDFSExceptionHelpers.java:63)
	at io.pravega.segmentstore.storage.impl.hdfs.CreateOperation.call(CreateOperation.java:48)
	at io.pravega.segmentstore.storage.impl.hdfs.CreateOperation.call(CreateOperation.java:29)
	at io.pravega.segmentstore.storage.impl.hdfs.HDFSStorage.lambda$supplyAsync$1(HDFSStorage.java:225)
	... 7 common frames omitted

Problem location AutoScaleProcessor

Suggestions for an improvement Ensure scale up is attempted only once for a given segment, or make it idempotent without any exceptions in the logs

Issue Analytics

State:
Created 6 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

fpjcommented, Aug 3, 2017

I think there is a race in here:

https://github.com/pravega/pravega/blob/master/segmentstore/server/host/src/main/java/io/pravega/segmentstore/server/host/stat/AutoScaleProcessor.java#L153

because we write the event and then update the cache.

triggerScaleUp is triggered from append processor, so every append result could end up calling it. If this happens for the same segment concurrently (different connections, distinct AppendProcessor instances), then I don’t see how we are preventing the duplication.

0reactions

fpjcommented, Aug 5, 2017

@shiveshr this duplication is occurring because of lack of synchronization in triggerScaleUp (I think scale down has the same issue). Consequently, even if we synchronize the whole method, we would only be doing it once we decide to scale, not upon every append.

Also, the issue is only upon writing the request, I wonder if we can use a compare-and-set like approach to avoid synchronizing the whole method.

In any case, I think the description of this issue is wrong. The first few log messages refer to:

hulk/smallScale/0

while the exception refers to:

[Segment '_system/_commitStream/1']

I think the events are unrelated.

Top Results From Across the Web

javascript - Async parallel requests are running sequentially

As you have discovered, async.parallel() can only parallelize operations that are themselves asynchronous. If the operations are synchronous, then because ...

8.1 Parallel Execution Concepts - Oracle Help Center

Parallel execution enables the application of multiple CPU and I/O resources to the execution of a single SQL statement.

Horizontal Scaling and Request Parallelization for High ...

To help you take advantage of its scale, we encourage you to horizontally scale parallel requests to the Amazon S3 service endpoints.

Native batch ingestion - Apache Druid

Parallel task indexing ( index_parallel ) that can run multiple indexing ... Batch ingestion only replaces data in segments where it actively adds...

25 Using Parallel Execution

When a connection is between two processes on the same instance, the servers communicate by passing the buffers back and forth. When the...