UTF-8 serialization in python 2
See original GitHub issueI’m running python 2.7 connecting to AWS elasticsearch service using the 2.2 release of elasticsearch-py. To connect I use requests_aws4auth as recommended in your docs (thanks for integrating that!).
When writing to elasticsearch (bulk upload, creating a doc etc) I get this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2102: ordinal not in range(128)
I know that the library was changed a couple months ago to stop hiding unicode errors but this change coincides with the introduction of support for requests_aws4auth as they are both first seen in the 2.2 release and therefore downgrading is not an option for me. Handling unicode conversion myself piecemeal is non-trivial. Upgrading to python 3 is not an option yet given other dependencies.
Therefore, I have come up with a workaround for now using a custom serializer that essentially reverts the unicode change made earlier to this codebase:
from elasticsearch import Elasticsearch, RequestsHttpConnection, serializer, compat, exceptions
class JSONSerializerPython2(serializer.JSONSerializer):
"""Override elasticsearch library serializer to ensure it encodes utf characters during json dump.
See original at: https://github.com/elastic/elasticsearch-py/blob/master/elasticsearch/serializer.py#L42
A description of how ensure_ascii encodes unicode characters to ensure they can be sent across the wire
as ascii can be found here: https://docs.python.org/2/library/json.html#basic-usage
"""
def dumps(self, data):
# don't serialize strings
if isinstance(data, compat.string_types):
return data
try:
return json.dumps(data, default=self.default, ensure_ascii=True)
except (ValueError, TypeError) as e:
raise exceptions.SerializationError(data, e)
I hope this helps anyone else that runs into this issue.
Issue Analytics
- State:
- Created 8 years ago
- Reactions:19
- Comments:6 (3 by maintainers)
Top GitHub Comments
@vinitkumar you should never need to fork the repo - you can pass in your own serializer very simply:
@LucasBerbesson you’d have to convert everything to
unicode
before passing the data intosuper()
, otherwise you’d get the same issue.