Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

to_kafka throughput

See original GitHub issue

I’m testing to_kafka sink and its throughput is limited by polltime (0.2 sec). Looks like self.producer.poll(0) only polls for one message at a time and so only one callback is called every 0.2 seconds.

This fails:

def test_to_kafka_throughput():
    ARGS = {'bootstrap.servers': 'localhost:9092'}
    with kafka_service() as kafka:
        _, TOPIC = kafka
        source = Stream.from_iterable(range(100)).map(lambda x: str(x).encode())
        kafka = source.to_kafka(TOPIC, ARGS)
        out = kafka.sink_to_list()

        source.start()
        wait_for(
            lambda: len(out) == 100,
            5,
            period=0.1,
            fail_func=lambda: print("len(out) ==", len(out))
        )

The existing test_to_kafka test doesn’t catch this, because it starts waiting on the result only after all the items are emitted.

I spent some time tinkering with the code, but can’t figure out what’s wrong and how to fix this, so any ideas are appreciated.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:13 (13 by maintainers)

Top GitHub Comments

1reaction

roveocommented, Dec 4, 2020

I think I get it.

The source emits the next item only after the future it awaits on is resolved. But it is resolved only once poll is called and the previous callback is executed, so items are emitted and delivered one by one after each .poll() call, every 0.2 seconds.

0reactions

martindurantcommented, Dec 7, 2020

So would you say that it works as intended?

Well, this is certainly an understandable mode, effectively we have a buffer of one. But I don’t think the docstring describes this, so I would say its lacking in at least that. I don’t see any reason that we can’t have messages pile up in the producer buffer, since it will execute the callbacks for us. In the case that we don’t want to reference count, this would be much faster, as @roveo points out, due to batching, but messages might be sent out of order.

There is not much in the documentation about message ordering, except this opaque statement, another gap that needs filling.

Top Results From Across the Web

Benchmarking Kafka vs. Pulsar vs. RabbitMQ

Throughput : Kafka provides the highest throughput of all systems, writing 15x faster than RabbitMQ and 2x faster than Pulsar. Latency: Kafka ......

Benchmarking Apache Kafka: 2 Million Writes Per Second ...

If you are used to random-access data systems, like a database or key-value store, you will generally expect maximum throughput around 5,000 to ......

Performance - Apache Kafka

A brief overview of the performance characteristics of Kafka®. Read blog post.

Benchmarking Kafka producer throughput with Quarkus

The producer message throughput is around 14,700 on average for an average payload size of 1.1 KB, when running on one core. For...

Optimizing Kafka Performance

Kafka Main Performance Metrics · Throughput: the number of messages that arrive in a given amount of time · Latency: the amount of...