Reindex performance degrading logarithmically
See original GitHub issueHi! We’re re-indexing a 7GB index, and noticed that performance starts out fast then logarithmically degrades over time:
We’re using elasticsearch v1.7.3
and elasticsearch-py v1.9.0
.
We’re following all recommendations for increasing indexing performance, eg:
index.refresh_interval: -1
index.store.throttle.type: none
index.translog.flush_threshold_size: 1g
index.number_of_replicas: 0
Our cluster is at AWS and is comprised of the following:
- (5)
m4.xlarge
data nodes - (3)
m3.medium
master nodes - (1)
m4.large
client node
This cluster should be plenty beefy for indexing a paltry 7GB of data. The original indexing only took a couple of hours to complete, but this re-indexing has been going for nearly 24 hours and is only 70% done. And it only seems to be getting slower as time goes on. At this rate, the re-index will never finish.
We’ve tried various chunk sizes in the reindex()
call, but it doesn’t seem to affect performance so we’re using the default of 500.
The python script is relatively chill, while ES is blasting away at the CPU:
Any ideas on what would cause this behavior? And how to get past it? I’m suspecting that there’s an issue with scan/scroll. It’s almost like the client needs to seek through all the previous chunks to get to the next chunk, so everything is getting slower the further it gets. But that’s just a wild guess.
Fixing this is essential for completing our upgrade to ES 2.3, especially since we have indices that are 10x the size of this 7GB index that we will need to be reindexing as well. Thanks!
Issue Analytics
- State:
- Created 7 years ago
- Comments:15 (2 by maintainers)
Top GitHub Comments
This was solved in https://github.com/elastic/elasticsearch/issues/18253
Basically I needed to explicitly set a much higher
size
value in thescan_kwargs
argument toreindex()
. The default of 10 is way to low for reindex operations.I will open another ticket suggesting a higher default value for reindex(), or at least updating the documentation to explain why.
Hi @jsnod, the best way to contact the Elasticsearch core team with a technical issue like this one (a reproducible problem) is to open an issue at the Github repository: https://github.com/elastic/elasticsearch