Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance regression in JSONSerializer.default() in v7.15.0

See original GitHub issue

Elasticsearch version (bin/elasticsearch --version): 7.13.1

elasticsearch-py version (elasticsearch.__versionstr__): 7.15.0

Description of the problem including expected versus actual behavior: We’re a bit puzzled at this, but we’ve narrowed it into being related to us upgrading the Elasticsearch package from 7.14.0 to 7.15.0.

Upon upgrading, we experienced that all of our Elasticsearch calls rose significantly in latency - to the point where is cascaded across all of our systems. It took us a few days to figure out what was going on, but in the end, we simply downgraded the package to 7.14.0, and as seen from the graph here (taken from Elastic APM), it’s quite apparent it had an effect:

Skærmbillede 2021-10-07 kl 14 11 49

Steps to reproduce: Honestly I’m not sure how to describe a reproducible flow. I’ll be very happy to help debugging this against our production environment in any way possible, if someone has ideas on what to look for.

To others who might be experiencing the same issue: We’ve pinned our projects to 7.14.0 for now, as this effectively solves the issue for us.

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

sethmlarsoncommented, Oct 8, 2021

I think I’ve figured out the issue, it’s caused by https://github.com/elastic/elasticsearch-py/pull/1716 which moved the attempt to serialize Pandas and Numpy types into JSONSerializer.default() but since it looks like you’re relying on .default() for all keys that’s where the problem lies. Basically it’s attempting to import numpy and pandas and failing per key which is quite a bit of overhead.

Based on this and a guess for what your keys look like (assuming either Promises or str) I wonder if changing your .default() implementation would fix the issue for you immediately:

    def default(self, data):
        if isinstance(data, Promise):
            return force_str(data)
        if isinstance(data, str):
            return data
        return super().default(data)

Either way I’ll fix this issue and it’ll go out in a patch release of 7.15.

1reaction

HenrikOssipoffcommented, Oct 8, 2021

@sethmlarson Thanks for getting back to me so quick! Sorry for the late reply.

I’m not entirely sure how the HTTP Client logic works; we don’t explicitly define one ourselves, so I assume it selects one based on installed packages? I’ve attached a pip freeze - the project doesn’t use async:

amqp==5.0.6
anyio==3.3.2
argon2-cffi==21.1.0
asgiref==3.3.4
awesome-slugify==1.6.5
billiard==3.6.4.0
cachetools==4.2.2
celery==5.1.2
certifi==2020.12.5
cffi==1.14.6
chardet==4.0.0
charset-normalizer==2.0.6
click==7.1.2
click-didyoumean==0.0.3
click-plugins==1.1.1
click-repl==0.1.6
cool==3.1.24
coolshop-search-dsl==2.2.1
Django==3.2.8
django-healthz==0.0.5
django-redis==5.0.0
django-storages==1.11.1
djangorestframework==3.12.4
elastic-apm==6.5.0
elasticsearch==7.15.0
elasticsearch-dsl==7.4.0
google-api-core==2.1.0
google-auth==2.3.0
google-cloud-core==2.1.0
google-cloud-storage==1.42.3
google-crc32c==1.3.0
google-resumable-media==2.0.3
googleapis-common-protos==1.53.0
gunicorn==20.1.0
h11==0.12.0
h2==4.1.0
hpack==4.0.0
httpcore==0.13.7
httpx==0.19.0
hyperframe==6.0.1
idna==2.10
kombu==5.1.0
ldap3==2.5.2
limits==1.5.1
lxml==4.6.3
networkx==2.6.3
prompt-toolkit==3.0.18
protobuf==3.18.1
psycopg2==2.9.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.20
pylogbeat==2.0.0
pyodbc==4.0.32
pyparsing==2.4.7
python-dateutil==2.8.1
python-logstash-async==2.3.0
python-memcached==1.59
pytz==2021.3
redis==3.5.3
regex==2021.4.4
requests==2.25.1
rfc3986==1.5.0
rsa==4.7.2
sentry-sdk==1.4.3
six==1.15.0
sniffio==1.2.0
sqlparse==0.4.1
suds-jurko==0.6
Unidecode==0.4.21
urllib3==1.26.4
vine==5.0.0
wcwidth==0.2.5

There’s no deprecation warnings as far as I can see, although we do get this warning: ElasticsearchWarning: The client is unable to verify that the server is Elasticsearch due security privileges on the server side

We get that on 7.14.0 as well, though.

Regarding the API’s, it’s a bunch of somewhat complex search() calls. They contain a bunch of aggregations.

One thing I’d like to mention, is that we use a custom JSONSerializer to account for some weridness on our end - I’ve included it here just in case:

from django.utils.encoding import force_str
from django.utils.functional import Promise
from elasticsearch.serializer import JSONSerializer


class CoolSearchJSONSerializer(JSONSerializer):
    def default(self, data):
        if isinstance(data, Promise):
            return force_str(data)
        return super().default(data)

    def force_key_encoding(self, data):
        if isinstance(data, dict):

            def yield_key_value(d):
                for key, value in d.items():
                    try:
                        yield self.default(key), self.force_key_encoding(value)
                    except TypeError:
                        yield key, self.force_key_encoding(value)

            return dict(yield_key_value(data))
        else:
            return data

    def dumps(self, data):
        return super().dumps(self.force_key_encoding(data))

We use it like so:

connections.configure(
    default={
        "hosts": ES_HOSTS,
        "serializer": CoolSearchJSONSerializer(),
        "retry_on_timeout": True,
        "max_retries": 5,
        "http_auth": (os.getenv("ELASTICSEARCH_USER"), os.getenv("ELASTICSEARCH_PASSWORD")),
    }
)

We can do a bisect of the commits relating to 7.15.0 if need be, but I hope you’re able to see something I can’t instead. We’d have to put the code into our production system to have enough traffic to see the results, and it’s only fully visible in the APM after a few hours - so it’d take a while to find the offending commit. Plus the impact of our production systems, of course.

What’s really weird is that we use Elasticsearch across multiple projects, but this particular project is that only one we’ve seen the issue with. It is the only one of the projects with this massive amount of traffic though, which might be the reason.

We’re very puzzled.