to_kafka throughput
See original GitHub issueI’m testing to_kafka
sink and its throughput is limited by polltime (0.2 sec). Looks like self.producer.poll(0)
only polls for one message at a time and so only one callback is called every 0.2 seconds.
This fails:
def test_to_kafka_throughput():
ARGS = {'bootstrap.servers': 'localhost:9092'}
with kafka_service() as kafka:
_, TOPIC = kafka
source = Stream.from_iterable(range(100)).map(lambda x: str(x).encode())
kafka = source.to_kafka(TOPIC, ARGS)
out = kafka.sink_to_list()
source.start()
wait_for(
lambda: len(out) == 100,
5,
period=0.1,
fail_func=lambda: print("len(out) ==", len(out))
)
The existing test_to_kafka
test doesn’t catch this, because it starts waiting on the result only after all the items are emitted.
I spent some time tinkering with the code, but can’t figure out what’s wrong and how to fix this, so any ideas are appreciated.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:13 (13 by maintainers)
Top Results From Across the Web
Benchmarking Kafka vs. Pulsar vs. RabbitMQ
Throughput : Kafka provides the highest throughput of all systems, writing 15x faster than RabbitMQ and 2x faster than Pulsar. Latency: Kafka ......
Read more >Benchmarking Apache Kafka: 2 Million Writes Per Second ...
If you are used to random-access data systems, like a database or key-value store, you will generally expect maximum throughput around 5,000 to ......
Read more >Performance - Apache Kafka
A brief overview of the performance characteristics of Kafka®. Read blog post.
Read more >Benchmarking Kafka producer throughput with Quarkus
The producer message throughput is around 14,700 on average for an average payload size of 1.1 KB, when running on one core. For...
Read more >Optimizing Kafka Performance
Kafka Main Performance Metrics · Throughput: the number of messages that arrive in a given amount of time · Latency: the amount of...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think I get it.
The source emits the next item only after the future it awaits on is resolved. But it is resolved only once poll is called and the previous callback is executed, so items are emitted and delivered one by one after each
.poll()
call, every 0.2 seconds.Well, this is certainly an understandable mode, effectively we have a buffer of one. But I don’t think the docstring describes this, so I would say its lacking in at least that. I don’t see any reason that we can’t have messages pile up in the producer buffer, since it will execute the callbacks for us. In the case that we don’t want to reference count, this would be much faster, as @roveo points out, due to batching, but messages might be sent out of order.
There is not much in the documentation about message ordering, except this opaque statement, another gap that needs filling.