Having to run 12x Elastic Rally instances on the `elastic\logs` track to bottleneck the CPU on the hot data tier
See original GitHub issueWhile I don’t have anything super useful to add here in terms of replacements, I would just like to throw my anecdotal hat into this ring with respect to the elastic/logs
track I was trying to run against our new NVMe backed hot data tier on on-prem hardware within an ECE cluster. The results I was getting scaling from targeting 1 shard to 2 shards and beyond didn’t improve the overall indexing throughput. I specifically increased the corpus size to around 60 days of data to ensure I had plenty of events to index. My goal was to understand the behaviour the new cluster with respect to hot spotting, shard and replica counts. Unfortunately, Elastic Rally initially gave me the wrong idea.
It wasn’t until I ran multiple copies of Elastic Rally with identical settings concurrently from the same host was I able to actually start approach any of the hardware limits in the cluster. In the end, I had to run 12x Elastic Rally instances on the elastic\logs
track to bottleneck the CPU on the hot data tier. I executed all 12 instances from a single server (backed by NVMe, 128 GB of RAM, 32c/64t, 10 Gb network). This resulted in the actual indexing rate rising from 60-70,000
doc/s to 550-600,000
docs/s. The reality was that the server sending the logs weren’t a limiting factor, nor were the hot data tier nodes, but Elastic Rally in quickly providing the documents fast enough to index.
My suspicion was that, similar to the Golang stdlb for encoding/json
, that the performance is not super optimised in Python. This issue seems to validate that theory, I just wanted to provide a real world example of where Elastic Rally performance is producing results that could be easily misconstrued by naive users such as myself.
_Originally posted by @berglh in https://github.com/elastic/rally/issues/1046#issuecomment-1225252763_
Issue Analytics
- State:
- Created a year ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
I opened https://github.com/elastic/rally-tracks/pull/309, please tell me what you think!
Not sure listing would help as much, I think the directory layout makes it clear that all subdirectories are tracks. And the list would quickly get stale.
I opened https://github.com/elastic/rally/pull/1568, please tell me what you think!
By default, indexing goes “as fast as possible”, but it still waits for each request to be completed before sending another one. This is why you need more indexing clients to saturate the CPU, the work the load driver does is I/O bound. Does that make sense? I’m not sure I’ve understood your point.
Now, If you want to know the latency for each client, you can configure a metrics store and look at the metrics.
Makes sense!
Anyway, I’m going to close this issue now as there’s nothing actionable for Rally left here. Thanks!
Thanks for your help @pquentin