question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Elasticsearch bulk() chunk size is too large for AWS Elasticsearch Service.

See original GitHub issue

If your site has a lot of search index data, and you run the update_index management command, using an Elasticsearch5 backend through AWS’s Elasticsearch Service, the index operation can crash with the following error:

  File "celery/app/trace.py", line 374, in trace_task
    R = retval = fun(*args, **kwargs)
  File "celery/app/trace.py", line 629, in __protected_call__
    return self.run(*args, **kwargs)
  File "<app>/celery.py", line 49, in wrapper
    f(*args, **kwargs)
  File "core/tasks.py", line 20, in rebuild_search_index
    call_command('update_index')
  File "django/core/management/__init__.py", line 130, in call_command
    return command.execute(*args, **defaults)
  File "django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "wagtail/wagtailsearch/management/commands/update_index.py", line 120, in handle
    self.update_backend(backend_name, schema_only=options.get('schema_only', False))
  File "wagtail/wagtailsearch/management/commands/update_index.py", line 87, in update_backend
    index.add_items(model, chunk)
  File "wagtail/wagtailsearch/backends/elasticsearch.py", line 580, in add_items
    bulk(self.es, actions)
  File "elasticsearch/helpers/__init__.py", line 194, in bulk
    for ok, item in streaming_bulk(client, actions, **kwargs):
  File "elasticsearch/helpers/__init__.py", line 162, in streaming_bulk
    for result in _process_bulk_chunk(client, bulk_actions, raise_on_exception, raise_on_error, **kwargs):
  File "elasticsearch/helpers/__init__.py", line 91, in _process_bulk_chunk
    raise e
  File "elasticsearch/helpers/__init__.py", line 87, in _process_bulk_chunk
    resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs)
  File "elasticsearch/client/utils.py", line 71, in _wrapped
    return func(*args, params=params, **kwargs)
  File "elasticsearch/client/__init__.py", line 1096, in bulk
    doc_type, '_bulk'), params=params, body=self._bulk_body(body))
  File "elasticsearch/transport.py", line 318, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "elasticsearch/connection/http_urllib3.py", line 127, in perform_request
    self._raise_error(response.status, raw_data)
  File "elasticsearch/connection/base.py", line 122, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
TransportError: TransportError(413, '{"Message":"Request size exceeded 10485760 bytes"}')

It gets thrown from the elasticsearch.helpers.bulk() call in wagtail.wagtailsearch.backends.elasticsearch.ElasticSearchIndex.add_items().

This happens because the default value of the max_chunk_bytes keyword arg for bulk() is 100 megabytes, but Amazon only allows 10 megabytes per chunk (as evidenced by that error message).

I’ve subclassed ElasticSearchIndex and overridden add_items() to make it explicitly set max_chunk_size to 10 megs, and I can confirm that this fixes the crash.

I’d make a PR for this, but I’m not really sure what the best way to implement it would be. Simply changing the code to set this keyword arg seems unlikely to be globally viable, since other Elasticsearch services presumably allow larger chunks. Perhaps an additional config setting in WAGTAILSEARCH_BACKENDS would work? I’m not really sure how those function, though, since I’ve never used them.

What do you guys think?

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:2
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

3reactions
stuart-clark-45commented, May 21, 2019

Expect there is something similar to this for other languages also, but if you are using python you can use the following to get around the problem:

https://elasticsearch-py.readthedocs.io/en/master/helpers.html#elasticsearch.helpers.parallel_bulk

just need to set max_chunk_bytes=10485760

3reactions
coredumperrorcommented, May 18, 2018

Not without paying for beefier servers. It’s limited to 10 MB on anything smaller than an xlarge.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resolve search or write rejections in Amazon OpenSearch ...
This bulk queue error occurs when the number of requests to the cluster exceeds the bulk queue size (threadpool.bulk.queue_size). A bulk queue ...
Read more >
What is the ideal bulk size formula in ElasticSearch?
Read ES bulk API doc carefully: ... When performance starts to drop off, your batch size is too big. A good place to...
Read more >
Fix common cluster issues | Elasticsearch Guide [8.5] | Elastic
The most common causes of high CPU usage and their solutions. High JVM memory pressure: High JVM memory usage can degrade cluster performance...
Read more >
Elasticsearch Performance Tuning to Handle Traffic Spikes
We use Amazon Elasticsearch Service because it's fully managed, ... too large a bulk size will put the cluster under memory pressure.
Read more >
Helpers — Elasticsearch 7.16.0 documentation
All bulk helpers accept an instance of Elasticsearch class and an ... docs in one chunk sent to es (default: 500); max_chunk_bytes –...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found