Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Proposal] Improve memory estimations in `OnHeapIncrementalIndex`

See original GitHub issue

Motivation

The existing implementation in OnHeapIncrementalIndex tends to over-estimate memory usage thus leading to more persistence cycles than necessary during ingestion. A more accurate estimation of mem usage would also free it up for other purposes.

The recent changes in #11950 have made improvements by adding a guessAggregatorHeapFootprint() method to the Aggregator interface. This proposal also addresses the same problem but advocates replacing the guess mechanism with getting the actual incremental memory used by an aggregator.

Proposed changes

Update the method Aggregator.aggregate() to return a long instead of void. The returned long would represent the incremental memory in bytes used by the aggregator in that particular invocation of the method.
Remove the method DimensionIndexer.estimateEncodedKeyComponentSize()
Update the method DimensionIndexer.getUnsortedEncodedValueFromSorted() to return objects of a new generic class EncodedDimensionValue<EncodedType> which contains:
- EncodedType value: e.g. int[] for StringDimIndexer, Long for LongDimIndexer
- long incrementalSize: The delta in size required for the value. For numerical values, e.g. in LongDimensionIndexer, it would just be the size of the datatype. But for StringDimensionIndexer, which returns an encoded int [], it would represent the size of the array and also any new dimension value that has been encoded into the dictionary in this invocation. Simply put, the getUnsortedEncodedValueFromSorted() now returns a payload and also the memory required for that payload.

Rationale

Estimation of a row size row size = aggregator size + dims key size + overhead

Aggregator Size Currently, we compute the max size for the aggregator and use it for every row, which overestimates the actual memory usage. With the proposed change, we would be using the actual footprint for each row and not the max thus getting more accurate estimates.

Dims Key Size The estimation of the dimension key size already takes into account the current row and not the max size. But here, the overestimates are caused by repeatedly adding the footprint of the same String values (especially in the case of multi-valued dimensions) which are in fact stored only once in the dictionary and only the integers codes are present in every row. With the proposed change, we add the footprint of a String value only once when it is being newly added to the dictionary.

Backwards Compatibility

Neither of the changes mentioned above would be backwards compatible.

Aggregator Size Any class implementing the Aggregator interface would need to be fixed to return a long instead of void.

Workaround (Rejected): To retain compatibilty, we could add a new default long aggregateAndEstimateMemory() method and leave the existing aggregate() method as is. The default implementation would return a negative value (say -1) in which case the OnHeapIncrementalIndex or any other estimator would use the max estimated size for that invocation.

But this approach would be pretty hacky and problematic in the long run as callers of the Aggregator would be free to call either of the two aggregate methods thus producing widely different and erroneous memory estimations.

Dims Key Size Any class implementing the DimensionIndexer would have to be fixed. (This change is less of a compatibility concern as compared to the Aggregator change as the DimensionIndexer has fewer implementations)

Workaround (Rejected): To retain compatibility, we could retain the DimensionIndexer.estimateEncodedKeyComponentSize() which would account for a String value only when it is newly encountered. But this would require having two dictionaries, the first used by the method getUnsortedEncodedValueFromSorted() to track encoded String values encoding and the second used by estimateEncodedKeyComponentSize() to track estimated String values.

This workaround would introduce unnecessary complexity and overhead of maintaining two dictionaries.

Operational impact

All aggregator implementations in extensions would have to be updated.
Rolling Upgrade: Not affected

Future Work (optional)

The memory estimate values returned by the updated aggregate() method could be used by other callers (such as TimeseriesQueryEngine) to estimate memory if required.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

gianmcommented, Dec 3, 2021

Workaround (Rejected): To retain compatibilty, we could add a new default long aggregateAndEstimateMemory() method and leave the existing aggregate() method as is. The default implementation would return a negative value (say -1) in which case the OnHeapIncrementalIndex or any other estimator would use the max estimated size for that invocation.

This approach actually seems like a good idea to me for a few reasons:

Aggregation is hot code, and there will be some overhead to size estimation, and query-time callers don’t necessarily need it. So having two methods allows the caller to decide if it actually needs size estimates or not.
We want to avoid incompatible changes to aggregation interfaces whenever possible.

0reactions

kfarazcommented, Dec 6, 2021

Based on the discussion above, we will modify the aggregator interfaces as follows:

Interface AggregatorFactory will get a new factorizeWithSize() method which returns a structure containing both the Aggregator instance as well as the initial memory size.

Interface Aggregator will get a new aggregateWithSize() method which returns a long representing the incremental memory used in that invocation of aggregateWithSize(). The default impl of this method would call aggregate() and return 0. Aggregators such as sum can rely on the default impl itself, thus always returning 0 from aggregateWithSize() effectively making the aggregator size same as the initial size returned from AggregatorFactory.factorizeWithSize()

In the first iteration, we would put the new estimation logic behind a feature flag. After some more testing, the flag can be removed altogether.

The following metrics will also be added (if they don’t already exist):

number of incremental persists
num rows/num bytes in each persist
time taken to fill up buffer
time taken to persist buffer

Top Results From Across the Web

Theory of Mind in ADHD. A Proposal to Improve Working ...

The aim of this study was to investigate the relationships between Theory of Mind (ToM), Working Memory (WM), and Verbal Comprehension (VC).

MEMORY - American Psychological Association

ii MEMORY. MEMORY. A Five-Unit Lesson Plan for High School Psychology Teachers ... 1.3 discuss strategies for improving the encoding of memory.

Stimulation to Improve Memory (STIM) - ClinicalTrials.gov

The findings will help determine "how much" stimulation is needed to enhance memory and thinking abilities, how it affects brain functioning ...

Short-term memory loss: Definition, loss, psychology, and more

Short-term memory refers to small amounts of information that people can remember for a short period of time. Learn more.

Offering Treatment For Brain Health & Memory Disorders in ...

Cognitive memory disorder specialists at University Hospitals provide the latest ... they face and help them develop a path toward improved brain health....