when insert Large number of data Python's elasticsearch bulk API trigger OOM
See original GitHub issueimport time
import sys
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
reload(sys)
sys.setdefaultencoding('utf-8')
def set_mapping(es, index_name = "content_engine", doc_type_name = "en"):
my_mapping = {
"en": {
"properties": {
"a": {
"type": "string"
},
"b": {
"type": "string"
}
}
}
}
create_index = es.indices.create(index = index_name,body = my_mapping)
mapping_index = es.indices.put_mapping(index = index_name, doc_type = doc_type_name, body = my_mapping)
if create_index["acknowledged"] != True or mapping_index["acknowledged"] != True:
print "Index creation failed..."
def set_data(es, input_file, index_name = "content_engine", doc_type_name="en"):
i = 0
count = 0
ACTIONS = []
for line in open(input_file):
fields = line.replace("\r\n", "").replace("\n", "").split("----")
if len(fields) == 2:
a, b = fields
else:
continue
action = {
"_index": index_name,
"_type": doc_type_name,
"_source": {
"a": a,
"b": b,
}
}
i += 1
ACTIONS.append(action)
if (i == 500000):
success, _ = bulk(es, ACTIONS, index = index_name, raise_on_error = True)
count += success
i = 0
ACTIONS = []
success, _ = bulk(es, ACTIONS, index = index_name, raise_on_error=True)
count += success
print("insert %s lines" % count)
if __name__ == '__main__':
es = Elasticsearch(hosts=["127.0.0.1:9200"], timeout=5000)
set_mapping(es)
set_data(es,sys.argv[1])
my machine’s memory is 24G,data’s size is 5G, when insert 1600w lines,The memory began soaring,finally trigger OOM, where trigger Memory leak。
Issue Analytics
- State:
- Created 7 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
when insert Large number of data Python's elasticsearch bulk ...
my machine's memory is 24G,data's size is 5G, when insert 1600w lines,The memory began soaring,finally trigger OOM, where trigger Memory leak。
Read more >How to Index Elasticsearch Documents with the Bulk API in ...
In this tutorial, we will demonstrate how to index Elasticsearch documents from a CSV file with simple Python code.
Read more >Using Asyncio with Elasticsearch
Helper for the bulk() api that provides a more human friendly interface - it consumes an iterator of actions and sends them to...
Read more >Elasticsearch DSL Documentation - Read the Docs
It exposes the whole range of the DSL from Python either directly using defined classes or a queryset-like expressions. It also provides an ......
Read more >Connection types and options for ETL in AWS Glue
For exporting a large table, we recommend switching your DynamoDB table to ... and print the number of partitions from an AWS Glue...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
You are accumulating
500000
documents in memory which is quite a lot. What I would recommend tis to just use a generator:The
bulk
helper already does chunking of the data and sending them into elasticsearch so these is no need for you to duplicate such logic with theACTIONS
.Hope this helps
I understand, but there is nontrivial overhead that python has on each document plus the
bulk
helper also adds on top of that when it creates the batches for elasticsearch. Either way there is abslutely no benefit in batching the documents yourself and it is consuming memory for no effect.Please change your code to using a generator to verify whether it is really elasticsearch-py that causes the memory use, thank you!