Elasticsearch bulk() chunk size is too large for AWS Elasticsearch Service.
See original GitHub issueIf your site has a lot of search index data, and you run the update_index
management command, using an Elasticsearch5 backend through AWS’s Elasticsearch Service, the index operation can crash with the following error:
File "celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "<app>/celery.py", line 49, in wrapper
f(*args, **kwargs)
File "core/tasks.py", line 20, in rebuild_search_index
call_command('update_index')
File "django/core/management/__init__.py", line 130, in call_command
return command.execute(*args, **defaults)
File "django/core/management/base.py", line 330, in execute
output = self.handle(*args, **options)
File "wagtail/wagtailsearch/management/commands/update_index.py", line 120, in handle
self.update_backend(backend_name, schema_only=options.get('schema_only', False))
File "wagtail/wagtailsearch/management/commands/update_index.py", line 87, in update_backend
index.add_items(model, chunk)
File "wagtail/wagtailsearch/backends/elasticsearch.py", line 580, in add_items
bulk(self.es, actions)
File "elasticsearch/helpers/__init__.py", line 194, in bulk
for ok, item in streaming_bulk(client, actions, **kwargs):
File "elasticsearch/helpers/__init__.py", line 162, in streaming_bulk
for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
File "elasticsearch/helpers/__init__.py", line 91, in _process_bulk_chunk
raise e
File "elasticsearch/helpers/__init__.py", line 87, in _process_bulk_chunk
resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs)
File "elasticsearch/client/utils.py", line 71, in _wrapped
return func(*args, params=params, **kwargs)
File "elasticsearch/client/__init__.py", line 1096, in bulk
doc_type, '_bulk'), params=params, body=self._bulk_body(body))
File "elasticsearch/transport.py", line 318, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "elasticsearch/connection/http_urllib3.py", line 127, in perform_request
self._raise_error(response.status, raw_data)
File "elasticsearch/connection/base.py", line 122, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
TransportError: TransportError(413, '{"Message":"Request size exceeded 10485760 bytes"}')
It gets thrown from the elasticsearch.helpers.bulk()
call in wagtail.wagtailsearch.backends.elasticsearch.ElasticSearchIndex.add_items()
.
This happens because the default value of the max_chunk_bytes
keyword arg for bulk()
is 100 megabytes, but Amazon only allows 10 megabytes per chunk (as evidenced by that error message).
I’ve subclassed ElasticSearchIndex
and overridden add_items()
to make it explicitly set max_chunk_size
to 10 megs, and I can confirm that this fixes the crash.
I’d make a PR for this, but I’m not really sure what the best way to implement it would be. Simply changing the code to set this keyword arg seems unlikely to be globally viable, since other Elasticsearch services presumably allow larger chunks. Perhaps an additional config setting in WAGTAILSEARCH_BACKENDS
would work? I’m not really sure how those function, though, since I’ve never used them.
What do you guys think?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:6 (4 by maintainers)
Top GitHub Comments
Expect there is something similar to this for other languages also, but if you are using python you can use the following to get around the problem:
https://elasticsearch-py.readthedocs.io/en/master/helpers.html#elasticsearch.helpers.parallel_bulk
just need to set
max_chunk_bytes=10485760
Not without paying for beefier servers. It’s limited to 10 MB on anything smaller than an xlarge.