Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

parallel searching in multi-gpu setting using faiss

See original GitHub issue

While I notice that add_faiss_index has supported assigning multiple GPUs, I am still confused about how it works.

Does the search-batch function automatically parallelizes the input queries to different gpus?https://github.com/huggingface/datasets/blob/d76599bdd4d186b2e7c4f468b05766016055a0a5/src/datasets/search.py#L360

Issue Analytics

State:
Created a year ago
Comments:25 (25 by maintainers)

Top GitHub Comments

1reaction

xwwwwwwcommented, Aug 2, 2022

Have you tried passing gpu=-1 and check if there is a speedup?

yes, there is a speed up using GPU compared with CPU.

0reactions

xwwwwwwcommented, Aug 27, 2022

Here is a runnable script. Multi-GPU searching still does not work in my experiments.

import os
from tqdm import tqdm
import numpy as np
import datasets
from datasets import Dataset

class DPRSelector:

    def __init__(self, source, target, index_name, gpu=None):
        self.source = source
        self.target = target
        self.index_name = index_name

        cache_path = 'embedding.faiss'

        if not os.path.exists(cache_path):
            self.source.add_faiss_index(
                column="embedding",
                index_name=index_name,
                device=gpu,
            )
            self.source.save_faiss_index(index_name, cache_path)
        else:
            self.source.load_faiss_index(
                index_name,
                cache_path,
                device=gpu
            )
        print('index builded!')

    def build_dataset(self, top_k, batch_size):
        print('start search')

        for i in tqdm(range(0, len(self.target), batch_size)):
            if i + batch_size >= len(self.target):
                batched_queries = self.target[i:]
            else:
                batched_queries = self.target[i:i+batch_size]


            batched_query_embeddings = np.stack([i for i in batched_queries['embedding']], axis=0)
            search_res = self.source.get_nearest_examples_batch(
                self.index_name,
                batched_query_embeddings,
                k=top_k
            )
      
        print('finish search')


def get_pseudo_dataset():
    pseudo_dict = {"embedding": np.zeros((1000000, 768), dtype=np.float32)}
    print('generate pseudo data')

    dataset = Dataset.from_dict(pseudo_dict)
    def list_to_array(data):
        return {"embedding": [np.array(vector, dtype=np.float32) for vector in data["embedding"]]} 
    dataset.set_transform(list_to_array, columns='embedding', output_all_columns=True)

    print('build dataset')
    return dataset



if __name__=="__main__":

    np.random.seed(42)


    source_dataset = get_pseudo_dataset()
    target_dataset = get_pseudo_dataset()

    gpu = [0,1,2,3,4,5,6,7]
    selector = DPRSelector(source_dataset, target_dataset, "embedding",  gpu=gpu)

    selector.build_dataset(top_k=20, batch_size=32)

By the way, have you run this toy example and replicated my experiment results? I think it is a more direct way to figure this out 😃

Top Results From Across the Web

Faiss: A library for efficient similarity search

This month, we released Facebook AI Similarity Search (Faiss), a library that allows us to quickly search for multimedia documents that are ...

Using faiss to search in multidimensional spaces

Since we need to search among hundreds of millions of vectors quickly enough, exhaustive search is not an option — we need an...

Billion-scale similarity search with GPUs

This paper tackles the problem of better utilizing GPUs for this task. While. GPUs excel at data-parallel tasks, prior approaches are bot-.

Multi-GPU k-Nearest Neighbor Search in the Context of ...

vectors from Ω,. 2. Compute k-nearest neighbors for the query vector by using a sorting algorithm,. 3. Repeat steps 1 and 2 for...

Nearest Neighbors Search Using Multi-GPU

multi-GPU version of the grid method for solving the kNN problem. ... to reduce the computational cost is to explore the parallel nature...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

parallel searching in multi-gpu setting using faiss

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

pyarrow.lib.ArrowCapacityError: array cannot contain more than 2147483646 bytes, have 2147483648

the_pile datasets URL broken.