ElasticSearch Document Store add Bulk parallel write method
See original GitHub issueI have been playing with the library today and I need to insert around 1Millions documents in elastic search.
I found that we have a write_document
method that supports bulk writing and it takes around 25 minutes to insert my document in elastic search.
By searching, I found that elastic search has a bluk_parrallel
method which does bulk insert but in parallel, I was thinking it can be a good idea to have it implemented.
The function is similar to thewrite_document
but with some `bulk parallel
Here is what I am using.
def write_documents_parallel(
self,
documents: Union[List[dict], List[Document]],
index: Optional[str] = None,
batch_size: int = 10_000,
duplicate_documents: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
):
"""
Indexes documents for later queries in Elasticsearch./ an update for the index that use parallel bluk
Behaviour if a document with the same ID already exists in ElasticSearch:
a) (Default) Throw Elastic's standard error message for duplicate IDs.
b) If `self.update_existing_documents=True` for DocumentStore: Overwrite existing documents.
(This is only relevant if you pass your own ID when initializing a `Document`.
If don't set custom IDs for your Documents or just pass a list of dictionaries here,
they will automatically get UUIDs assigned. See the `Document` class for details)
:param documents: a list of Python dictionaries or a list of Haystack Document objects.
For documents as dictionaries, the format is {"content": "<the-actual-text>"}.
Optionally: Include meta data via {"content": "<the-actual-text>",
"meta":{"name": "<some-document-name>, "author": "somebody", ...}}
It can be used for filtering and is accessible in the responses of the Finder.
Advanced: If you are using your own Elasticsearch mapping, the key names in the dictionary
should be changed to what you have set for self.content_field and self.name_field.
:param index: Elasticsearch index where the documents should be indexed. If not supplied, self.index will be used.
:param batch_size: Number of documents that are passed to Elasticsearch's bulk function at a time.
:param duplicate_documents: Handle duplicates document based on parameter options.
Parameter options : ( 'skip','overwrite','fail')
skip: Ignore the duplicates documents
overwrite: Update any existing documents with the same ID when adding documents.
fail: an error is raised if the document ID of the document being added already
exists.
:param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
:raises DuplicateDocumentError: Exception trigger on duplicate document
:return: None
"""
if index and not self.client.indices.exists(index=index, headers=headers):
self._create_document_index(index, headers=headers)
if index is None:
index = self.index
duplicate_documents = duplicate_documents or self.duplicate_documents
assert (
duplicate_documents in self.duplicate_documents_options
), f"duplicate_documents parameter must be {', '.join(self.duplicate_documents_options)}"
field_map = self._create_document_field_map()
document_objects = [Document.from_dict(d, field_map=field_map) if isinstance(d, dict) else d for d in documents]
document_objects = self._handle_duplicate_documents(
documents=document_objects, index=index, duplicate_documents=duplicate_documents, headers=headers
)
documents_to_index = []
for doc in tqdm(document_objects):
_doc = {
"_op_type": "index" if duplicate_documents == "overwrite" else "create",
"_index": index,
**doc.to_dict(field_map=self._create_document_field_map()),
} # type: Dict[str, Any]
# cast embedding type as ES cannot deal with np.array
if _doc[self.embedding_field] is not None:
if type(_doc[self.embedding_field]) == np.ndarray:
_doc[self.embedding_field] = _doc[self.embedding_field].tolist()
# rename id for elastic
_doc["_id"] = str(_doc.pop("id"))
# don't index query score and empty fields
_ = _doc.pop("score", None)
_doc = {k: v for k, v in _doc.items() if v is not None}
# In order to have a flat structure in elastic + similar behaviour to the other DocumentStores,
# we "unnest" all value within "meta"
if "meta" in _doc.keys():
for k, v in _doc["meta"].items():
_doc[k] = v
_doc.pop("meta")
documents_to_index.append(_doc)
# Pass batch_size number of documents to bulk
if len(documents_to_index) % batch_size == 0:
pb_ = parallel_bulk(self.client,
documents_to_index,
chunk_size=10000,
thread_count=8,
queue_size=8,
refresh=self.refresh_type,
headers=headers)
deque(pb_, maxlen=0)
documents_to_index = []
if documents_to_index:
pb_= parallel_bulk(self.client,
documents_to_index,
chunk_size=10000,
thread_count=8,
queue_size=8,
refresh=self.refresh_type,
headers=headers)
deque(pb_, maxlen=0)
Let me know what you guys think.
If it seems okay, I can find some time over the weekend to submit a PR and support the hard work you are doing. But I may need someone to help in testing this.
Issue Analytics
- State:
- Created a year ago
- Comments:10 (7 by maintainers)
Top Results From Across the Web
Bulk API | Elasticsearch Guide [8.5] | Elastic
Provides a way to perform multiple index , create , delete , and update actions in a single request. The actions are specified...
Read more >The Complete Guide to Increasing Your Elasticsearch Write ...
Multiple strategies that you can use to increase Elasticsearch write capacity for batch jobs and/or online transactions.
Read more >Elasticsearch - How to use Python helpers to bulk load data ...
This step-by-step tutorial explains how to use Python helpers to bulk load data into an Elasticsearch index.
Read more >Elasticsearch indexing multiple files in parallel - Stack Overflow
Each file has its own index and that is done using bulk indexing. To simulate the simultaneous case, I tried creating similar 10...
Read more >Helpers — Elasticsearch 7.12.0 documentation
All bulk helpers accept an instance of Elasticsearch class and an iterable actions (any iterable, can also be a generator, which is ideal...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks @sjrl for the clarification.
Closing this issue now, feel free to re-open if this is still an issue. 😃