Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Elasticsearch bulk() chunk size is too large for AWS Elasticsearch Service.

See original GitHub issue

If your site has a lot of search index data, and you run the update_index management command, using an Elasticsearch5 backend through AWS’s Elasticsearch Service, the index operation can crash with the following error:

  File "celery/app/trace.py", line 374, in trace_task
    R = retval = fun(*args, **kwargs)
  File "celery/app/trace.py", line 629, in __protected_call__
    return self.run(*args, **kwargs)
  File "<app>/celery.py", line 49, in wrapper
    f(*args, **kwargs)
  File "core/tasks.py", line 20, in rebuild_search_index
    call_command('update_index')
  File "django/core/management/__init__.py", line 130, in call_command
    return command.execute(*args, **defaults)
  File "django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "wagtail/wagtailsearch/management/commands/update_index.py", line 120, in handle
    self.update_backend(backend_name, schema_only=options.get('schema_only', False))
  File "wagtail/wagtailsearch/management/commands/update_index.py", line 87, in update_backend
    index.add_items(model, chunk)
  File "wagtail/wagtailsearch/backends/elasticsearch.py", line 580, in add_items
    bulk(self.es, actions)
  File "elasticsearch/helpers/__init__.py", line 194, in bulk
    for ok, item in streaming_bulk(client, actions, **kwargs):
  File "elasticsearch/helpers/__init__.py", line 162, in streaming_bulk
    for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
  File "elasticsearch/helpers/__init__.py", line 91, in _process_bulk_chunk
    raise e
  File "elasticsearch/helpers/__init__.py", line 87, in _process_bulk_chunk
    resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs)
  File "elasticsearch/client/utils.py", line 71, in _wrapped
    return func(*args, params=params, **kwargs)
  File "elasticsearch/client/__init__.py", line 1096, in bulk
    doc_type, '_bulk'), params=params, body=self._bulk_body(body))
  File "elasticsearch/transport.py", line 318, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "elasticsearch/connection/http_urllib3.py", line 127, in perform_request
    self._raise_error(response.status, raw_data)
  File "elasticsearch/connection/base.py", line 122, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
TransportError: TransportError(413, '{"Message":"Request size exceeded 10485760 bytes"}')

It gets thrown from the elasticsearch.helpers.bulk() call in wagtail.wagtailsearch.backends.elasticsearch.ElasticSearchIndex.add_items().

This happens because the default value of the max_chunk_bytes keyword arg for bulk() is 100 megabytes, but Amazon only allows 10 megabytes per chunk (as evidenced by that error message).

I’ve subclassed ElasticSearchIndex and overridden add_items() to make it explicitly set max_chunk_size to 10 megs, and I can confirm that this fixes the crash.

I’d make a PR for this, but I’m not really sure what the best way to implement it would be. Simply changing the code to set this keyword arg seems unlikely to be globally viable, since other Elasticsearch services presumably allow larger chunks. Perhaps an additional config setting in WAGTAILSEARCH_BACKENDS would work? I’m not really sure how those function, though, since I’ve never used them.

What do you guys think?