Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Batch cos_sim for community_detection?

See original GitHub issue

I’ve been experimenting with the community_detection method but noticed I quickly get OOM errors if I use too large of embeddings.

Seeing how it uses cos_sim to computed all the embedding distances, do you think it would make sense to have the option for batching? I believe you will find other bottlenecks when iterating over the entries, but at least it will complete on larger embeddings.

Issue Analytics

State:
Created 2 years ago
Comments:13 (3 by maintainers)

Top GitHub Comments

2reactions

nreimerscommented, Dec 13, 2021

Yes. When the models return pytorch tensors (return_tensors=True) you can move them to CPU like this:

my_embeddings = my_embeddings.to('cpu')

Computation will then be done on the CPU

2reactions

nreimerscommented, Jun 23, 2021

Hi @mmaybeno @yjernite has created this batched version. I sadly did not yet have time to review and test it, but hope I can do it soon and integrate it into sentence transformers.

import math
import torch
from time import time
from tqdm import tqdm
def community_detection(embeddings, threshold=0.75, min_community_size=1, init_max_size=1000):
    top_val = torch.Tensor(0, init_max_size)
    top_idx = torch.LongTensor(0, init_max_size)
    # Compute cosine similarity scores
    n_batches = math.ceil(len(embeddings) / init_max_size)
    print("computing scores")
    for b in tqdm(range(n_batches)):
        cos_scores = torch.mm(embeddings[b*init_max_size:(b+1)*init_max_size], embeddings.t())
        top_val_large, top_idx_large = cos_scores.topk(k=init_max_size, dim=-1, largest=True)
        top_val = torch.cat([top_val, top_val_large], dim=0)
        top_idx = torch.cat([top_idx, top_idx_large], dim=0)
    print("done computing scores")
    print()
    # Minimum size for a community
    top_k_values = top_val[:,:min_community_size]
    # Filter for rows >= min_threshold
    print("clustering")
    extracted_communities = []
    for i in tqdm(range(len(top_k_values))):
        if top_k_values[i][-1] >= threshold:
            new_cluster = []
            # Only check top k most similar entries
            top_idx_large = top_idx[i].tolist()
            top_val_large = top_val[i].tolist()
            if top_val_large[-1] < threshold:
                for idx, val in zip(top_idx_large, top_val_large):
                    if val < threshold:
                        break
                    new_cluster.append(idx)
            else:
                # Iterate over all entries (slow)
                cos_scores = torch.mv(embeddings, embeddings[i])
                for idx, val in enumerate(cos_scores.tolist()):
                    if val >= threshold:
                        new_cluster.append(idx)
            extracted_communities.append(new_cluster)
    # Largest cluster first
    extracted_communities = sorted(extracted_communities, key=lambda x: len(x), reverse=True)
    # Step 2) Remove overlapping communities
    unique_communities = []
    extracted_ids = set()
    for community in extracted_communities:
        add_cluster = True
        for idx in community:
            if idx in extracted_ids:
                add_cluster = False
                break
        if add_cluster:
            unique_communities.append(community)
            for idx in community:
                extracted_ids.add(idx)
    return unique_communities

Top Results From Across the Web

Cosine Similarity of Neighborhoods (All Pairs, Batch)

This algorithm computes the same similarity scores as the Cosine similarity of neighborhoods, single source algorithm. Instead of selecting a single source ...

sentence-transformers/util.py at master - GitHub

Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j. ... if isinstance(batch[key], Tensor): ... Function for Fast Community Detection.

util — Sentence-Transformers documentation

Function for Fast Community Detection Finds in the embeddings all communities, ... Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j....

UCoDe: Unified Community Detection with Graph ... - arXiv

We propose UCoDe, a unified method for unsupervised commu- nity detection in attributed graphs. It leverages recent developments in Graph Neural ...

Community detection and reciprocity in networks by jointly ...

We test the ability of the models to (i) recover the communities, (ii) perform edge prediction tasks and (iii) generate sample networks that...