Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

when insert Large number of data Python's elasticsearch bulk API trigger OOM

See original GitHub issue

import time
import sys
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

reload(sys)
sys.setdefaultencoding('utf-8')

def set_mapping(es, index_name = "content_engine", doc_type_name = "en"):
    my_mapping = {
        "en": {
            "properties": {
                "a": {
                    "type": "string"
                 },
                 "b": {
                    "type": "string"
                 }
            }
        }
    }
    create_index = es.indices.create(index = index_name,body = my_mapping)
    mapping_index = es.indices.put_mapping(index = index_name, doc_type = doc_type_name, body = my_mapping)
    if create_index["acknowledged"] != True or mapping_index["acknowledged"] != True:
        print "Index creation failed..."

def set_data(es, input_file, index_name = "content_engine", doc_type_name="en"):
    i = 0
    count = 0
    ACTIONS = []
    for line in open(input_file):
        fields = line.replace("\r\n", "").replace("\n", "").split("----")
        if len(fields) == 2:
            a, b = fields
        else:
            continue
        action = {
            "_index": index_name,
            "_type": doc_type_name,
            "_source": {
                  "a": a,
                  "b": b, 
            }
        }
        i += 1
        ACTIONS.append(action)
        if (i == 500000):
            success, _ = bulk(es, ACTIONS, index = index_name, raise_on_error = True)
            count += success
            i = 0
            ACTIONS = []

    success, _ = bulk(es, ACTIONS, index = index_name, raise_on_error=True)
    count += success
    print("insert %s lines" % count)


if __name__ == '__main__':
    es = Elasticsearch(hosts=["127.0.0.1:9200"], timeout=5000)
    set_mapping(es)
    set_data(es,sys.argv[1])

my machine’s memory is 24G,data’s size is 5G, when insert 1600w lines,The memory began soaring，finally trigger OOM, where trigger Memory leak。

Issue Analytics

State:
Created 7 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

35reactions

honzakralcommented, Dec 22, 2016

You are accumulating 500000 documents in memory which is quite a lot. What I would recommend tis to just use a generator:

def set_data(input_file, index_name = "content_engine", doc_type_name="en"):
    for line in open(input_file):
        fields = line.replace("\r\n", "").replace("\n", "").split("----")
        if len(fields) == 2:
            a, b = fields
        else:
            continue
        yield {
            "_index": index_name,
            "_type": doc_type_name,
            "_source": {
                  "a": a,
                  "b": b, 
            }
        }
def load(es, input_file, **kwargs):
    success, _ = bulk(es, set_data(input_file, **kwargs))

The bulk helper already does chunking of the data and sending them into elasticsearch so these is no need for you to duplicate such logic with the ACTIONS.

Hope this helps

4reactions

honzakralcommented, Dec 22, 2016

I understand, but there is nontrivial overhead that python has on each document plus the bulk helper also adds on top of that when it creates the batches for elasticsearch. Either way there is abslutely no benefit in batching the documents yourself and it is consuming memory for no effect.

Please change your code to using a generator to verify whether it is really elasticsearch-py that causes the memory use, thank you!