UnicodeDecodeError if using AWS ElasticSearch cluster
See original GitHub issueIssue Summary
Doesn’t work with provided django settings. With alternative settings (from Wagtail issue 2776, getting UnicodeDecodeError
if any special character is present in Page.title
(or any other indexable Page
field I believe).
If using with AWS ElasticSearch service, first issue I came across is that default django settings from Wagtail documentation don’t work if signing the request with AWS4Auth
(recommended by AWS).
I found the solution that works in Wagtail issue 2776. The very last comment by @justinoue.
But assuming I have a character like u'C\xe9line'
(which is u'Céline'
) as Page.title
, update_index
breaks with UnicodeDecodeError
.
Steps to Reproduce
- Create
IAM
user, give itAmazonESFullAccess - AWS Managed policy
permission. - Spin up AWS ElasticSearch instance with Access Policy allowing access to IAM user above.
- Have any of the
Page
instances title set tou'C\xc3\xa9line'
. - Use settings from Wagtail documentation:
from elasticsearch import RequestsHttpConnection
from requests_aws4auth import AWS4Auth
AWS_ELASTICSEARCH_ACCESS_KEY_ID = '<YOUR_ACCESS_ID>'
AWS_ELASTICSEARCH_SECRET_ACCESS_KEY = '<YOUR_SECRET_ACCESS_KEY>'
WAGTAILSEARCH_BACKENDS = {
'default': {
'BACKEND': 'wagtail.wagtailsearch.backends.elasticsearch2',
'URLS': ['https://<AWS_ES_ENDPOINT>'],
'INDEX': 'wagtail',
'TIMEOUT': 5,
'OPTIONS': {
'connection_class': RequestsHttpConnection,
},
'INDEX_SETTINGS': {},
'port': 443,
'use_ssl': True,
'verify_certs': True,
'http_auth': AWS4Auth(AWS_ELASTICSEARCH_ACCESS_KEY_ID, AWS_ELASTICSEARCH_SECRET_ACCESS_KEY, '<AWS_ES_REGION>', 'es'),
}
}
- ^^ This is going to break as it doesn’t sign the requests and thinks you’re trying to do this as anonymous user. The error looks like this:
elasticsearch.exceptions.AuthorizationException: TransportError(403, u'{"Message":"User: anonymous is not authorized to perform: es:ESHttpDelete on resource: am-dev"}')
- Change the settings to match the example from Wagtail issue 2776.
from elasticsearch import RequestsHttpConnection
from requests_aws4auth import AWS4Auth
AWS_ELASTICSEARCH_ACCESS_KEY_ID = '<YOUR_ACCESS_ID>'
AWS_ELASTICSEARCH_SECRET_ACCESS_KEY = '<YOUR_SECRET_ACCESS_KEY>'
WAGTAILSEARCH_BACKENDS = {
'default': {
'BACKEND': 'wagtail.wagtailsearch.backends.elasticsearch2',
'INDEX': 'wagtail',
'TIMEOUT': 5,
'HOSTS': [{
'host': '<AWS_ES_ENDPOINT>'',
'port': 443,
'use_ssl': True,
'verify_certs': True,
'http_auth': AWS4Auth(AWS_ELASTICSEARCH_ACCESS_KEY_ID, AWS_ELASTICSEARCH_SECRET_ACCESS_KEY, '<AWS_ES_REGION>', 'es'),
}],
'connection_class': RequestsHttpConnection
}
}
- Run
./manage.py update_index --verbosity=3
- You will get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17345: ordinal not in range(128)
Full traceback is here.
Notes
position 17345
where it breaks is the index of special character.
In [1]: message_body[17344:17351]
Out[1]: 'C\xc3\xa9line'
or specifically:
In [2]: message_body[17345]
Out[2]: '\xc3'
- The settings in step 6 are not documented. Note how it uses
hosts
instead ofURLS
etc. - Interesting that the issue ultimately traces down to python’s httplib.py:880 where it tries to do
msg += message_body
where msg
is unicode but message_body
is string containing our special character.
See the gist with variables msg
and message_body
. Note, some parameters from msg (AWS_ES_ENDPOINT
, SHA
, AWS_ELASTICSEARCH_ACCESS_KEY_ID
, SIGNATURE
) are hidden since they contain sensitive information.
Technical details
- Python version:
Python 2.7.13
- Django version:
Django 1.9.6
- Wagtail version:
Wagtail 1.8.1
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:11 (1 by maintainers)
Top GitHub Comments
p.s. @StriveForBest thank you for your exemplary issue report!
@gasman, makes sense. I will try a similar setup with Elasticsearch 5 and latest Wagtail soon and will open another issue if necessary. I think it’s important to maintain Python 2 support.