question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Two scale requests triggered in parallel for the same segment

See original GitHub issue

Problem description In one of the platform tests we see the following logs which indicate scale up is triggered in parallel for the same segment:

2017-08-02 18:11:41,670 1091297 [segment-store-46] INFO  i.p.s.s.h.stat.AutoScaleProcessor - received traffic for hulk/smallScale/0 with twoMinute rate = 36.92450249113799 and targetRate = 3
2017-08-02 18:11:41,671 1091298 [segment-store-25] INFO  i.p.s.s.h.stat.AutoScaleProcessor - received traffic for hulk/smallScale/0 with twoMinute rate = 36.92450249113799 and targetRate = 3
2017-08-02 18:11:41,671 1091298 [segment-store-46] INFO  i.p.s.s.h.stat.AutoScaleProcessor - sending request for scale up for hulk/smallScale/0
2017-08-02 18:11:41,671 1091298 [segment-store-25] INFO  i.p.s.s.h.stat.AutoScaleProcessor - sending request for scale up for hulk/smallScale/0

Due to this we see segment already exists exception from HDFS in the segment store logs:

2017-08-02 17:56:05,669 155296 [segment-store-43] ERROR i.p.s.s.h.h.PravegaRequestProcessor - Error (Segment = '_system/_commitStream/1', Operation = 'Create segment')
io.pravega.segmentstore.contracts.StreamSegmentExistsException: [Segment '_system/_commitStream/1'] The StreamSegment exists already
	at io.pravega.segmentstore.storage.impl.hdfs.HDFSExceptionHelpers.translateFromException(HDFSExceptionHelpers.java:46)
	at io.pravega.segmentstore.storage.impl.hdfs.HDFSStorage.handleException(HDFSStorage.java:238)
	at io.pravega.segmentstore.storage.impl.hdfs.HDFSStorage.lambda$supplyAsync$1(HDFSStorage.java:227)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.fs.FileAlreadyExistsException: _system/_commitStream/1
	at io.pravega.segmentstore.storage.impl.hdfs.HDFSExceptionHelpers.segmentExistsException(HDFSExceptionHelpers.java:63)
	at io.pravega.segmentstore.storage.impl.hdfs.CreateOperation.call(CreateOperation.java:48)
	at io.pravega.segmentstore.storage.impl.hdfs.CreateOperation.call(CreateOperation.java:29)
	at io.pravega.segmentstore.storage.impl.hdfs.HDFSStorage.lambda$supplyAsync$1(HDFSStorage.java:225)
	... 7 common frames omitted

Problem location AutoScaleProcessor

Suggestions for an improvement Ensure scale up is attempted only once for a given segment, or make it idempotent without any exceptions in the logs

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
fpjcommented, Aug 3, 2017

I think there is a race in here:

https://github.com/pravega/pravega/blob/master/segmentstore/server/host/src/main/java/io/pravega/segmentstore/server/host/stat/AutoScaleProcessor.java#L153

because we write the event and then update the cache.

triggerScaleUp is triggered from append processor, so every append result could end up calling it. If this happens for the same segment concurrently (different connections, distinct AppendProcessor instances), then I don’t see how we are preventing the duplication.

0reactions
fpjcommented, Aug 5, 2017

@shiveshr this duplication is occurring because of lack of synchronization in triggerScaleUp (I think scale down has the same issue). Consequently, even if we synchronize the whole method, we would only be doing it once we decide to scale, not upon every append.

Also, the issue is only upon writing the request, I wonder if we can use a compare-and-set like approach to avoid synchronizing the whole method.

In any case, I think the description of this issue is wrong. The first few log messages refer to:

hulk/smallScale/0

while the exception refers to:

[Segment '_system/_commitStream/1']

I think the events are unrelated.

Read more comments on GitHub >

github_iconTop Results From Across the Web

javascript - Async parallel requests are running sequentially
As you have discovered, async.parallel() can only parallelize operations that are themselves asynchronous. If the operations are synchronous, then because ...
Read more >
8.1 Parallel Execution Concepts - Oracle Help Center
Parallel execution enables the application of multiple CPU and I/O resources to the execution of a single SQL statement.
Read more >
Horizontal Scaling and Request Parallelization for High ...
To help you take advantage of its scale, we encourage you to horizontally scale parallel requests to the Amazon S3 service endpoints.
Read more >
Native batch ingestion - Apache Druid
Parallel task indexing ( index_parallel ) that can run multiple indexing ... Batch ingestion only replaces data in segments where it actively adds...
Read more >
25 Using Parallel Execution
When a connection is between two processes on the same instance, the servers communicate by passing the buffers back and forth. When the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found