question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ElasticSearch Document Store add Bulk parallel write method

See original GitHub issue

I have been playing with the library today and I need to insert around 1Millions documents in elastic search.

I found that we have a write_document method that supports bulk writing and it takes around 25 minutes to insert my document in elastic search.

By searching, I found that elastic search has a bluk_parrallel method which does bulk insert but in parallel, I was thinking it can be a good idea to have it implemented.

The function is similar to thewrite_document but with some `bulk parallel

Here is what I am using.

def write_documents_parallel(
        self,
        documents: Union[List[dict], List[Document]],
        index: Optional[str] = None,
        batch_size: int = 10_000,
        duplicate_documents: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
    ):
        """
        Indexes documents for later queries in Elasticsearch./ an update for the index that use parallel bluk

        Behaviour if a document with the same ID already exists in ElasticSearch:
        a) (Default) Throw Elastic's standard error message for duplicate IDs.
        b) If `self.update_existing_documents=True` for DocumentStore: Overwrite existing documents.
        (This is only relevant if you pass your own ID when initializing a `Document`.
        If don't set custom IDs for your Documents or just pass a list of dictionaries here,
        they will automatically get UUIDs assigned. See the `Document` class for details)

        :param documents: a list of Python dictionaries or a list of Haystack Document objects.
                          For documents as dictionaries, the format is {"content": "<the-actual-text>"}.
                          Optionally: Include meta data via {"content": "<the-actual-text>",
                          "meta":{"name": "<some-document-name>, "author": "somebody", ...}}
                          It can be used for filtering and is accessible in the responses of the Finder.
                          Advanced: If you are using your own Elasticsearch mapping, the key names in the dictionary
                          should be changed to what you have set for self.content_field and self.name_field.
        :param index: Elasticsearch index where the documents should be indexed. If not supplied, self.index will be used.
        :param batch_size: Number of documents that are passed to Elasticsearch's bulk function at a time.
        :param duplicate_documents: Handle duplicates document based on parameter options.
                                    Parameter options : ( 'skip','overwrite','fail')
                                    skip: Ignore the duplicates documents
                                    overwrite: Update any existing documents with the same ID when adding documents.
                                    fail: an error is raised if the document ID of the document being added already
                                    exists.
        :param headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='})
                Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
        :raises DuplicateDocumentError: Exception trigger on duplicate document
        :return: None
        """

        if index and not self.client.indices.exists(index=index, headers=headers):
            self._create_document_index(index, headers=headers)

        if index is None:
            index = self.index
        duplicate_documents = duplicate_documents or self.duplicate_documents
        assert (
            duplicate_documents in self.duplicate_documents_options
        ), f"duplicate_documents parameter must be {', '.join(self.duplicate_documents_options)}"

        field_map = self._create_document_field_map()
        document_objects = [Document.from_dict(d, field_map=field_map) if isinstance(d, dict) else d for d in documents]
        document_objects = self._handle_duplicate_documents(
            documents=document_objects, index=index, duplicate_documents=duplicate_documents, headers=headers
        )
        documents_to_index = []
        for doc in tqdm(document_objects):
            _doc = {
                "_op_type": "index" if duplicate_documents == "overwrite" else "create",
                "_index": index,
                **doc.to_dict(field_map=self._create_document_field_map()),
            }  # type: Dict[str, Any]

            # cast embedding type as ES cannot deal with np.array
            if _doc[self.embedding_field] is not None:
                if type(_doc[self.embedding_field]) == np.ndarray:
                    _doc[self.embedding_field] = _doc[self.embedding_field].tolist()

            # rename id for elastic
            _doc["_id"] = str(_doc.pop("id"))

            # don't index query score and empty fields
            _ = _doc.pop("score", None)
            _doc = {k: v for k, v in _doc.items() if v is not None}

            # In order to have a flat structure in elastic + similar behaviour to the other DocumentStores,
            # we "unnest" all value within "meta"
            if "meta" in _doc.keys():
                for k, v in _doc["meta"].items():
                    _doc[k] = v
                _doc.pop("meta")
            documents_to_index.append(_doc)

            # Pass batch_size number of documents to bulk
            if len(documents_to_index) % batch_size == 0:
                pb_ = parallel_bulk(self.client, 
                              documents_to_index, 
                              chunk_size=10000, 
                              thread_count=8, 
                              queue_size=8,
                              refresh=self.refresh_type, 
                              headers=headers)
                deque(pb_, maxlen=0)
                documents_to_index = []

        if documents_to_index:
            pb_= parallel_bulk(self.client, 
                          documents_to_index, 
                          chunk_size=10000, 
                          thread_count=8, 
                          queue_size=8,
                          refresh=self.refresh_type, 
                          headers=headers)
            deque(pb_, maxlen=0)

Let me know what you guys think.

If it seems okay, I can find some time over the weekend to submit a PR and support the hard work you are doing. But I may need someone to help in testing this.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
Squishy-33commented, Jul 22, 2022

Thanks @sjrl for the clarification.

0reactions
bogdankosticcommented, Nov 3, 2022

Closing this issue now, feel free to re-open if this is still an issue. 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Bulk API | Elasticsearch Guide [8.5] | Elastic
Provides a way to perform multiple index , create , delete , and update actions in a single request. The actions are specified...
Read more >
The Complete Guide to Increasing Your Elasticsearch Write ...
Multiple strategies that you can use to increase Elasticsearch write capacity for batch jobs and/or online transactions.
Read more >
Elasticsearch - How to use Python helpers to bulk load data ...
This step-by-step tutorial explains how to use Python helpers to bulk load data into an Elasticsearch index.
Read more >
Elasticsearch indexing multiple files in parallel - Stack Overflow
Each file has its own index and that is done using bulk indexing. To simulate the simultaneous case, I tried creating similar 10...
Read more >
Helpers — Elasticsearch 7.12.0 documentation
All bulk helpers accept an instance of Elasticsearch class and an iterable actions (any iterable, can also be a generator, which is ideal...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found