Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UTF-8 serialization in python 2

See original GitHub issue

I’m running python 2.7 connecting to AWS elasticsearch service using the 2.2 release of elasticsearch-py. To connect I use requests_aws4auth as recommended in your docs (thanks for integrating that!).

When writing to elasticsearch (bulk upload, creating a doc etc) I get this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2102: ordinal not in range(128)

I know that the library was changed a couple months ago to stop hiding unicode errors but this change coincides with the introduction of support for requests_aws4auth as they are both first seen in the 2.2 release and therefore downgrading is not an option for me. Handling unicode conversion myself piecemeal is non-trivial. Upgrading to python 3 is not an option yet given other dependencies.

Therefore, I have come up with a workaround for now using a custom serializer that essentially reverts the unicode change made earlier to this codebase:

from elasticsearch import Elasticsearch, RequestsHttpConnection, serializer, compat, exceptions

class JSONSerializerPython2(serializer.JSONSerializer):
    """Override elasticsearch library serializer to ensure it encodes utf characters during json dump.
    See original at: https://github.com/elastic/elasticsearch-py/blob/master/elasticsearch/serializer.py#L42
    A description of how ensure_ascii encodes unicode characters to ensure they can be sent across the wire
    as ascii can be found here: https://docs.python.org/2/library/json.html#basic-usage
    """
    def dumps(self, data):
        # don't serialize strings
        if isinstance(data, compat.string_types):
            return data
        try:
            return json.dumps(data, default=self.default, ensure_ascii=True)
        except (ValueError, TypeError) as e:
            raise exceptions.SerializationError(data, e)

I hope this helps anyone else that runs into this issue.

Issue Analytics

State:
Created 8 years ago
Reactions:19
Comments:6 (3 by maintainers)

Top GitHub Comments

16reactions

honzakralcommented, May 12, 2016

@vinitkumar you should never need to fork the repo - you can pass in your own serializer very simply:

from elasticsearch import Elasticsearch
es = Elasticsearch(..., serializer=JSONSerializerPython2())

0reactions

honzakralcommented, Jul 15, 2016

@LucasBerbesson you’d have to convert everything to unicode before passing the data into super(), otherwise you’d get the same issue.

Top Results From Across the Web

python 2.7 - how to print with utf-8 characters? - Stack Overflow

When using unicode, it has to be serialized or encoded to bytes before writing to files. You have bytes but you try to...

Building a Python 2/3 compatible Unicode Sandwich

The best solution is to use Unicode everywhere in Python 2, importing from builtins import str (as recommended above) and then using isinstance( ......

Solving Unicode Problems in Python 2.7 - Azavea

If you've just run into the Python 2 Unicode brick wall, here are ... UTF-8, UTF-16, and UTF-32 are serialization formats — NOT...

pickle — Python object serialization — Python 3.11.1 ...

JSON is a text serialization format (it outputs unicode text, although most of the time it is then encoded to utf-8 ), while...

Serializing Python Objects - Dive Into Python 3

To convert a list of integers back into a bytes object, you can use the bytes() function. That was it; there were only...