question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Excessive memory usage on multithreading

See original GitHub issue

I have been trying to debug a “memory leak” in my newly upgraded boto3 application. I am moving from the original boto 2.49.

My application starts a pool of 100 thread and every request is queued and redirected to one of these threads and usual memory for the lifetime of the appication was about 1GB with peaks of 1.5GB depending of the operation.

After the upgrade I added one boto3.Session per thread and I access multiple resources and clients from this session which are reused throughout the code. On previous code I would have a boto connection of each kind per thread (I use several services like S3, DynamoDB, SES, SQS, Mturk, SimpleDB) so it is pretty much the same thing.

Except that each boto3.Session alone uses increases memory usage immensely and now my application is running on 3GB of memory instead.

How do I know it is the boto3 Session, you ask? I created 2 demo experiments with the same 100 threads and the only difference on both is using boto3 in one and not on the other.

Program 1: https://pastebin.com/Urkh3TDU Program 2: https://pastebin.com/eDWPcS8C (Same thing with 5 lines regarding boto commented out)

Output program 1 (each print happens 5 seconds after the last one):

Process Memory: 39.4 MB
Process Memory: 261.7 MB
Process Memory: 518.7 MB
Process Memory: 788.2 MB
Process Memory: 944.5 MB
Process Memory: 940.1 MB
Process Memory: 944.4 MB
Process Memory: 948.7 MB
Process Memory: 959.1 MB
Process Memory: 957.4 MB
Process Memory: 958.0 MB
Process Memory: 959.5 MB

Now with plain multiple threads and no AWS access. Output program 2 (each print happens 5 seconds after the last one):

Process Memory: 23.5 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB
Process Memory: 58.7 MB

Alone the boto3 session object is retaining 10MB per thread in a total of about 1GB. This is not acceptable from an object that should not be doing much more than requesting stuff to the AWS servers only. It means that the Session is keeping lots of unwanted information.

You could be wondering if it is not the resource that is keeping live memory. If you move the resource creation to inside the for loop, the program will also hit the 1GB in the exact the same 15 to 20 seconds of existence.

In the beginning I tried garbage collecting for cyclic references but it was futile. The decrease in memory was only a couple megabytes.

I’ve seen people complaining on botocore project on something similar (maybe not!), so it might be a shared issue. https://github.com/boto/botocore/issues/805

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:14
  • Comments:31 (3 by maintainers)

github_iconTop GitHub Comments

7reactions
maybeshewillcommented, Nov 1, 2018

confirm, just a simple creation of a boto3.session in threads/async handlers lead to extensive memory usage, that’s is not freed at all (gc.collect() doesn’t help too)

6reactions
jbvsmocommented, Jul 17, 2020

@cschloer @longbowrocks I created this issue 2 years ago and the situation is unchanged since. My solution at the time which is running today on hundreds of servers I have deployed is exactly that of a local cache that I add to the current thread object.

Below is the code I use (slightly edited) to replace the resource and client boto 3 functions that is thread safe and does not need to explicitly create sessions and your code doesn’t need to be aware it is inside a separate thread. You might need to do some cleanup to avoid open file warnings when terminating threads.

There are limitations to this and I offer no guarantees. Use with caution.

import json
import hashlib
import time
import threading
import boto3.session

DEFAULT_REGION = 'us-east-1'
KEY = None
SECRET = None


class AWSConnection(object):
    def __init__(self, function, name, **kw):
        assert function in ('resource', 'client')
        self._function = function
        self._name = name
        self._params = kw

        if not self._params:
            self._identifier = self._name
        else:
            self._identifier = self._name + hash_dict(self._params)

    def get_connection(self):
        thread = threading.currentThread()

        if not hasattr(thread, '_aws_metadata_'):
            thread._aws_metadata_ = {
                'age': time.time(),
                'session': boto3.session.Session(),
                'resource': {},
                'client': {}
            }

        try:
            connection = thread._aws_metadata_[self._function][self._identifier]
        except KeyError:
            connection = create_connection_object(
                self._function, self._name, session=thread._aws_metadata_['session'], **self._params
            )
            thread._aws_metadata_[self._function][self._identifier] = connection

        return connection

    def __repr__(self):
        return 'AWS {0._function} <{0._name}> {0._params}'.format(self)

    def __getattr__(self, item):
        connection = self.get_connection()
        return getattr(connection, item)


def create_connection_object(function, name, session=None, region=None, **kw):
    assert function in ('resource', 'client')
    if session is None:
        session = boto3.session.Session()

    if region is None:
        region = DEFAULT_REGION

    key, secret = KEY, SECRET

    # Do not set these variables unless they were configured on parameters file
    # If they are not present, boto3 will try to load them from other means
    if key and secret:
        kw['aws_access_key_id'] = key
        kw['aws_secret_access_key'] = secret

    return getattr(session, function)(name, region_name=region, **kw)


def hash_dict(dictionary):
    """ This function will hash a dictionary based on JSON encoding, so changes in
        list order do matter and will affect result.
        Also this is an hex output, so not size optimized
    """
    json_string = json.dumps(dictionary, sort_keys=True, indent=None)
    return hashlib.sha1(json_string.encode('utf-8')).hexdigest()


def resource(name, **kw):
    return AWSConnection('resource', name, **kw)


def client(name, **kw):
    return AWSConnection('client', name, **kw)
Read more comments on GitHub >

github_iconTop Results From Across the Web

python - How to control memory usage in multithreading?
I think you are looking at this from the wrong angle. Your code fires up n threads. Those threads then execute work that...
Read more >
How Memory Allocation Affects Performance in Multithreaded ...
False sharing impairs efficient use of the cache, which negatively affects performance. Fragmentation occurs when the actual memory consumption by a process ...
Read more >
Case Studies: Memory Behavior of Multithreaded Multimedia ...
Memory performance becomes a dominant factor for today's microprocessor applications. In this paper, we study memory reference behavior of emerging.
Read more >
High memory usage with threading : r/Python - Reddit
High memory usage with threading ... I have written a website crawler that needs to make a large number of requests (500 000)...
Read more >
Why is multithreading often preferred for improving ...
Multi-threading gets around requiring additional memory because it relies on a shared memory between threads. Shared memory removes the ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found