Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

remove unnecessary synchronization overhead from complex Aggregators

See original GitHub issue

Motivation

Many complex [Buffer]Aggregator implementations need to add synchronized access to internal data structures due to single-writer-multiple-reader concurrent usage of those during realtime indexing process where they are concurrently queried in addition to getting updated. However, that synchronization is totally unnecessary everywhere else but we pay its price anyway , for example at historical nodes while querying and in batch indexing tasks etc. Most recently this came up in https://github.com/apache/incubator-datasketches-java/issues/263 .

Proposed changes

I haven’t really done a prototype yet but I “think” these changes should be doable.

Add following methods (with default implementations) to AggregatorFactory .

  public Aggregator factorize(ColumnSelectorFactory metricFactory, boolean isConcurrent)
  {
    return factorize(metricFactory);
  }

  public BufferAggregator factorizeBuffered(ColumnSelectorFactory metricFactory, boolean isConcurrent)
  {
    return factorizeBuffered(metricFactory);
  }

And, replace all calls inside druid code from AggregatorFactory.factorize[Buffered](ColumnSelectorFactory) to AggregatorFactory.factorize[Buffered](ColumnSelectorFactory, boolean isConcurrent) with right value for boolean isConcurrent specified . IncrementalIndex would be made aware of its concurrency context (by changing existing variable concurrentEventAdd to isConcurrent and it being correctly specified in all places an IncrementalIndex instance is created ) so that it can set right value for isConcurrent when calling factorize[Buffered](..) Relevant complex aggregator such as thetaSketch can then override newly added methods to add synchronization only for cases where it is really needed.

Rationale

One other option would be that aggregator implementors get additional contextual information (e.g. the nodeType they are running on ) and based on that enable/disable synchronization. However, proposed approach is simpler to use for extension writers and takes away the guessing game. I also contemplated on adding an enum like

enum ConcurrencyContext {
  NONE
  MULTI_WRITE
  SINGLE_WRITE_MULTI_READ
 ...
 ..
}

and using it instead of boolean isConcurrent in newly introduced method arguments, but couldn’t see any significant advantages of doing that for now.

Operational impact

None

Test plan (optional)

Existing unit/integration tests would cover the changes introduced.

Future work (optional)

Adjust relevant complex aggregator implementations to take advantage of newly added methods.

Issue Analytics

State:
Created 4 years ago
Comments:9 (8 by maintainers)

Top GitHub Comments

1reaction

Eshcarcommented, Jul 11, 2019

that says that there is about 15 ms overhead for 1mn lock/release on object locks. I am pretty sure this is negligible compared to time spent doing sketch operations

This is not always correct. For example, an update of a theta sketch takes less than 10ns and when the sketch is very big it takes less than 5ns. Specifically, adding 1M uniques to a sketch takes less than 10ms. See https://datasketches.github.io/docs/Theta/ThetaUpdateSpeed.html. For these cases the overhead wrt the sketch operation is not negligible.

0reactions

himanshugcommented, Jul 22, 2019

related to https://github.com/apache/incubator-druid/issues/8126 which removes usage of Aggregator from indexing code as well . I will add code changes for this proposal as a follow up to https://github.com/apache/incubator-druid/issues/8126

Top Results From Across the Web

Static Analyses for Eliminating Unnecessary Synchronization ...

Abstract. This paper presents and evaluates a set of analyses designed to reduce synchronization overhead in Java programs. Monitor-based.

Eliminating Unnecessary Synchronization

Synchronization overhead can be reduced by manually restructuring programs but any performance improvement gained typically comes at the cost of simplicity, ...

Operator fusion in RxJava 2 - ProAndroidDev

Micro fusion — removing unnecessary synchronization and sharing internal structures (such as queues) between operators. Macro fusion on Assembly.

An Overview - Concurrency in C# Cookbook [Book] - O'Reilly

Concurrent software was difficult to write, difficult to debug, and … ... A form of concurrency that uses futures or callbacks to avoid...

What Is the Next Stop for Big Data? Hybrid Serving/Analytical ...

To analyze data in an OLTP system, we usually synchronize the data to an ... Unnecessary overhead is required if the non-transaction data...