question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

when insert Large number of data Python's elasticsearch bulk API trigger OOM

See original GitHub issue
import time
import sys
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

reload(sys)
sys.setdefaultencoding('utf-8')

def set_mapping(es, index_name = "content_engine", doc_type_name = "en"):
    my_mapping = {
        "en": {
            "properties": {
                "a": {
                    "type": "string"
                 },
                 "b": {
                    "type": "string"
                 }
            }
        }
    }
    create_index = es.indices.create(index = index_name,body = my_mapping)
    mapping_index = es.indices.put_mapping(index = index_name, doc_type = doc_type_name, body = my_mapping)
    if create_index["acknowledged"] != True or mapping_index["acknowledged"] != True:
        print "Index creation failed..."

def set_data(es, input_file, index_name = "content_engine", doc_type_name="en"):
    i = 0
    count = 0
    ACTIONS = []
    for line in open(input_file):
        fields = line.replace("\r\n", "").replace("\n", "").split("----")
        if len(fields) == 2:
            a, b = fields
        else:
            continue
        action = {
            "_index": index_name,
            "_type": doc_type_name,
            "_source": {
                  "a": a,
                  "b": b, 
            }
        }
        i += 1
        ACTIONS.append(action)
        if (i == 500000):
            success, _ = bulk(es, ACTIONS, index = index_name, raise_on_error = True)
            count += success
            i = 0
            ACTIONS = []

    success, _ = bulk(es, ACTIONS, index = index_name, raise_on_error=True)
    count += success
    print("insert %s lines" % count)


if __name__ == '__main__':
    es = Elasticsearch(hosts=["127.0.0.1:9200"], timeout=5000)
    set_mapping(es)
    set_data(es,sys.argv[1])

my machine’s memory is 24G,data’s size is 5G, when insert 1600w lines,The memory began soaring,finally trigger OOM, where trigger Memory leak。

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

35reactions
honzakralcommented, Dec 22, 2016

You are accumulating 500000 documents in memory which is quite a lot. What I would recommend tis to just use a generator:

def set_data(input_file, index_name = "content_engine", doc_type_name="en"):
    for line in open(input_file):
        fields = line.replace("\r\n", "").replace("\n", "").split("----")
        if len(fields) == 2:
            a, b = fields
        else:
            continue
        yield {
            "_index": index_name,
            "_type": doc_type_name,
            "_source": {
                  "a": a,
                  "b": b, 
            }
        }
def load(es, input_file, **kwargs):
    success, _ = bulk(es, set_data(input_file, **kwargs))

The bulk helper already does chunking of the data and sending them into elasticsearch so these is no need for you to duplicate such logic with the ACTIONS.

Hope this helps

4reactions
honzakralcommented, Dec 22, 2016

I understand, but there is nontrivial overhead that python has on each document plus the bulk helper also adds on top of that when it creates the batches for elasticsearch. Either way there is abslutely no benefit in batching the documents yourself and it is consuming memory for no effect.

Please change your code to using a generator to verify whether it is really elasticsearch-py that causes the memory use, thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

when insert Large number of data Python's elasticsearch bulk ...
my machine's memory is 24G,data's size is 5G, when insert 1600w lines,The memory began soaring,finally trigger OOM, where trigger Memory leak。
Read more >
How to Index Elasticsearch Documents with the Bulk API in ...
In this tutorial, we will demonstrate how to index Elasticsearch documents from a CSV file with simple Python code.
Read more >
Using Asyncio with Elasticsearch
Helper for the bulk() api that provides a more human friendly interface - it consumes an iterator of actions and sends them to...
Read more >
Elasticsearch DSL Documentation - Read the Docs
It exposes the whole range of the DSL from Python either directly using defined classes or a queryset-like expressions. It also provides an ......
Read more >
Connection types and options for ETL in AWS Glue
For exporting a large table, we recommend switching your DynamoDB table to ... and print the number of partitions from an AWS Glue...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found