question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

to_kafka throughput

See original GitHub issue

I’m testing to_kafka sink and its throughput is limited by polltime (0.2 sec). Looks like self.producer.poll(0) only polls for one message at a time and so only one callback is called every 0.2 seconds.

This fails:

def test_to_kafka_throughput():
    ARGS = {'bootstrap.servers': 'localhost:9092'}
    with kafka_service() as kafka:
        _, TOPIC = kafka
        source = Stream.from_iterable(range(100)).map(lambda x: str(x).encode())
        kafka = source.to_kafka(TOPIC, ARGS)
        out = kafka.sink_to_list()

        source.start()
        wait_for(
            lambda: len(out) == 100,
            5,
            period=0.1,
            fail_func=lambda: print("len(out) ==", len(out))
        )

The existing test_to_kafka test doesn’t catch this, because it starts waiting on the result only after all the items are emitted.

I spent some time tinkering with the code, but can’t figure out what’s wrong and how to fix this, so any ideas are appreciated.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
roveocommented, Dec 4, 2020

I think I get it.

The source emits the next item only after the future it awaits on is resolved. But it is resolved only once poll is called and the previous callback is executed, so items are emitted and delivered one by one after each .poll() call, every 0.2 seconds.

0reactions
martindurantcommented, Dec 7, 2020

So would you say that it works as intended?

Well, this is certainly an understandable mode, effectively we have a buffer of one. But I don’t think the docstring describes this, so I would say its lacking in at least that. I don’t see any reason that we can’t have messages pile up in the producer buffer, since it will execute the callbacks for us. In the case that we don’t want to reference count, this would be much faster, as @roveo points out, due to batching, but messages might be sent out of order.

There is not much in the documentation about message ordering, except this opaque statement, another gap that needs filling.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Benchmarking Kafka vs. Pulsar vs. RabbitMQ
Throughput : Kafka provides the highest throughput of all systems, writing 15x faster than RabbitMQ and 2x faster than Pulsar. Latency: Kafka ......
Read more >
Benchmarking Apache Kafka: 2 Million Writes Per Second ...
If you are used to random-access data systems, like a database or key-value store, you will generally expect maximum throughput around 5,000 to ......
Read more >
Performance - Apache Kafka
A brief overview of the performance characteristics of Kafka®. Read blog post.
Read more >
Benchmarking Kafka producer throughput with Quarkus
The producer message throughput is around 14,700 on average for an average payload size of 1.1 KB, when running on one core. For...
Read more >
Optimizing Kafka Performance
Kafka Main Performance Metrics · Throughput: the number of messages that arrive in a given amount of time · Latency: the amount of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found