question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PIP-191: Support batched message using entry filter

See original GitHub issue

discuss mail-thread: https://lists.apache.org/thread/cdw5c2lpj5nwzl2zqyv8mphsqqv9vozj

Motivation

  • This PIP introduces a way to support batched message using entry filter without having to deserilize the entry, through restricting same properties of messages in one batch.

  • We already have  a plug-in way to filter entries in broker, aka PIP-105 https://github.com/apache/pulsar/issues/12269.  But this way has some drawback:

    • It doesn’t support batched message naturally. Because the entry filter only knows the main header of the entry and doesn’t dig into the payload to deserialize the single message meta.
    • If the developer of the entry filter wants to filter batched message, he/she have to deserialize the payload to get the each message’s properties , which will bring higher memory and cpu workload .
  • Let’s expain the current entry filters in detail. Skip to “Solution” part directly if you have already been clear about the drawback above.

    • Today, when an entry filter receives an entry, it gets an Entry that has:

       public interface Entry {
       byte[] getData();
       byte[] getDataAndRelease();
       int getLength();
       ByteBuf getDataBuffer();
       Position getPosition();
       long getLedgerId();
       long getEntryId();
       boolean release();
       }
      

      The Entry interface doesn’t let you know if this is Batched Entry. You also get FilterContext:

      @Data
      public class FilterContext {
          private Subscription subscription;
          private MessageMetadata msgMetadata;
          private Consumer consumer;
      

      and in MessageMetadata, you have

       // differentiate single and batch message metadata
        optional int32 num_messages_in_batch = 11 [default = 1];
      

      Which enables you to know this entry is batched.

    • The developer can determine what class would deserialize the entry byte array into a list of separate messages. So currently, given the entry is batched, the filter developer can act on it only by paying the cost of deserializing it.

  • Soultions How can we using entry filter with batched messages, and without having to deserilize the entry?

    • One of rejected alternatives One of alternatives is that we can alter the producers to extract specific properties from each message and place those properties values in the message metadata of the Batched Entry. The filter can then use the values to decide if to reject/accept.

      The problem is that if you have different values for a given property for each message in the batch, then the filter author can’t provide a reject or accept for this entry since some messages are rejected, and some are accepted.

    • Soultion So the only solution is to change the way messages are batched and collect the records into a batch only if they have the same values for the properties configured to be extracted. If a message is added to the producer and the properties are not the same as the batched records, it will trigger a send of this batch and start a new batch with that message.

In summary, this proposal introduces another trigger condition to send the batch, on top of the current max count, max size, and max delay: Once a message is requested to be added to a batch of its properties (partial properteis as defined in a new configuration) values are different from the records in the batch (i.e. 1st record properties values), it will trigger the batch flush (i.e send and clear).

API Changes

  • Because we know which key/value of properties will be used in our entry filter, so we only need pick the properties which will be used to appy this proposal. Add a producer config to specialize the properties key/value. Only messages have same key/value of properties in the config will apply this proposal.

    org.apache.pulsar.client.impl.conf.ProducerConfigurationData#restrictSameValuesInBatchProperties
    
    • The  restrictSameValuesInBatchProperties type is Map<String, List<String>>, the map’key is the properties key, and map’value is the properties values.
    • If restrictSameValuesInBatchProperties is empty (default is empty), that means this grouped by properties will not take effect.
    • Messages with properties have same key/value contains in restrictSameValuesInBatchProperties will be placed into same batch.

Implementation

  • When call org.apache.pulsar.client.impl.BatchMessageContainerImpl#add,  we extract the message properties and add it to metadata:
 public boolean add(MessageImpl<?> msg, SendCallback callback) {

        if (log.isDebugEnabled()) {
            log.debug("[{}] [{}] add message to batch, num messages in batch so far {}", topicName, producerName,
                    numMessagesInBatch);
        }

        if (++numMessagesInBatch == 1) {
            try {
                // some properties are common amongst the different messages in the batch, hence we just pick it up from
                // the first message
                messageMetadata.setSequenceId(msg.getSequenceId());
                List<KeyValue> filterProperties = getProperties(msg);
                if (!filterProperties.isEmpty()) {
                    messageMetadata.addAllProperties(filterProperties);  // and message properties here
                }
  • Also we need to add a method hasSameProperties like hasSameSchema.  Messages with same properties can be added to the same batch. Once a message with different properties is added, the producer will triger flush and sending the batch.
 private boolean canAddToCurrentBatch(MessageImpl<?> msg) {
     return batchMessageContainer.haveEnoughSpace(msg)  // messageContainer controls the memory
               && (!isMultiSchemaEnabled(false) || batchMessageContainer.hasSameSchema(msg))
                && batchMessageContainer.hasSameProperties(msg)  //  invoke it here 
                && batchMessageContainer.hasSameTxn(msg);
    }

  • In summary, most of modification in this proposal just are:

    • Extract the first message properties in the batch and fill into the BatchMessageContainerImpl#messageMetada
    • Check if the sending message has same properties with the properties in BatchMessageContainerImpl#messageMetada additionally in ProducerImpl#canAddToCurrentBatch method.

Example

There is an example maybe helpful to understand this:

  • Let’s set restrictSameValuesInBatchProperties=<region=us,eu; version=1,2> This means only key named region values ‘us’ or ‘eu’, and version values ‘1’ or’2’ will be extracted to the batch meta properties

  • Then we have a producer that sends the messges below in order:

    • msg1 with properties: <region: eu>
    • msg2 with properties: <region: eu>
    • msg3 with properties: <region: eu, version:1, tag:a>
    • msg4 with properties: <region: eu, version:1>
    • msg5 with properties: <region: us, version:1>
    • msg6 with properties: <region: us, version:2>
    • msg7 with properties: <region: us, version:5>
    • msg8 with properties: <region: us, version:6>
  • The process of properties extraction will be:

    • msg1 and msg2 have the same properties: <region: eu>, so they will put into the same batch
    • msg3 and msg4 have the same properties: <region: eu, version:1>. tag:a in msg3 will be ignored because the restrictSameValuesInBatchProperties doesn’t contains ‘tag’. So msg3 and msg4 will put into the same batch.
    • msg5 and msg6 have different properties, because the value of version is different. So we publish msg5 and msg6 with different batch.
    • msg7 and msg8 have the same properties <region:us>, and <version> will be ignored because it’s values doesn’t exist in restrictSameValuesInBatchProperties.
  • Just to summarize, the result will be:

batch meta properties single meta properties payload single meta properties payload
batch1 <region: eu> <region: eu> msg1 <region: eu> msg2
batch2 <region: eu, version:1> <region: eu, version:1, tag:a> msg3 <region: eu, version:1> msg4
batch3 <region: us, version:1> <region: us, version:1> msg5
batch4 <region: us, version:2> <region: us, version:2> msg6
batch5 <region: us> <region: us, version:5> msg7 <region: us, version:6> msg7

Trade-off

The side effect of this behavior is that it can easily end up with tiny batches, perhaps even 1 record per batch. There is a good chance once they turn this feature on, they will lose all performance benefits of batching since the batches will be very small. It completely depends on the distribution of values.

In spite of this, we shoud clarify that, entry filter dosen’t support batched messages currently. So this proposal gives a big chance that batched messages can also using entry filter. It bring great benefits especially when you have konw the distrbution of values.

Reject Alternatives

  • Implement a AbstractBatchMessageContainer ,  saying BatchMessagePropertiesBasedContainer, keeping messages with same properties in a single hashmap entry,  like BatchMessageKeyBasedContainer.

Rejection reason:  This will publish messages out of order

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:22 (20 by maintainers)

github_iconTop GitHub Comments

1reaction
AnonHxycommented, Oct 3, 2022

What ended up with this feature?

There was a VOTE[1] about this PIP and end up with 1 bingding -1 and 1 non-bing +1. So maybe this feature need more discussion or be canceled @asafm

0reactions
github-actions[bot]commented, Nov 3, 2022

The issue had no activity for 30 days, mark with Stale label.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[DISCUSS] PIP-191: Support batched message using entry filter
gmail.com> 于2022年7月19日周二 18:00写道: > Hi Pulsar community: > > I open a pip to discuss "Support batched message using entry filter" ...
Read more >
Filter batches - Kofax Product Documentation
Batch filtering enables users to remove any unwanted batches from the Open Batch window in order to have a more personalized view of...
Read more >
Board Meeting Minutes - Pulsar - Apache Whimsy
Description: Pulsar is a highly scalable, low latency messaging ... Broker Load Balancer PIP-191: Support batched message using entry filter ...
Read more >
Batch Filters and Batch Commit - MuleSoft Documentation
Batch processing does not support job-instance-wide transactions. You can define a transaction inside a batch step that processes each record in a separate ......
Read more >
batch macro - filter with 2 columns help - Alteryx Community
hi! I am having some trouble finalizing this macro attached. Basically, what the workflow is supposed to do is check whether both the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found