question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Batch cos_sim for community_detection?

See original GitHub issue

I’ve been experimenting with the community_detection method but noticed I quickly get OOM errors if I use too large of embeddings.

Seeing how it uses cos_sim to computed all the embedding distances, do you think it would make sense to have the option for batching? I believe you will find other bottlenecks when iterating over the entries, but at least it will complete on larger embeddings.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
nreimerscommented, Dec 13, 2021

Yes. When the models return pytorch tensors (return_tensors=True) you can move them to CPU like this:

my_embeddings = my_embeddings.to('cpu')

Computation will then be done on the CPU

2reactions
nreimerscommented, Jun 23, 2021

Hi @mmaybeno @yjernite has created this batched version. I sadly did not yet have time to review and test it, but hope I can do it soon and integrate it into sentence transformers.

import math
import torch
from time import time
from tqdm import tqdm
def community_detection(embeddings, threshold=0.75, min_community_size=1, init_max_size=1000):
    top_val = torch.Tensor(0, init_max_size)
    top_idx = torch.LongTensor(0, init_max_size)
    # Compute cosine similarity scores
    n_batches = math.ceil(len(embeddings) / init_max_size)
    print("computing scores")
    for b in tqdm(range(n_batches)):
        cos_scores = torch.mm(embeddings[b*init_max_size:(b+1)*init_max_size], embeddings.t())
        top_val_large, top_idx_large = cos_scores.topk(k=init_max_size, dim=-1, largest=True)
        top_val = torch.cat([top_val, top_val_large], dim=0)
        top_idx = torch.cat([top_idx, top_idx_large], dim=0)
    print("done computing scores")
    print()
    # Minimum size for a community
    top_k_values = top_val[:,:min_community_size]
    # Filter for rows >= min_threshold
    print("clustering")
    extracted_communities = []
    for i in tqdm(range(len(top_k_values))):
        if top_k_values[i][-1] >= threshold:
            new_cluster = []
            # Only check top k most similar entries
            top_idx_large = top_idx[i].tolist()
            top_val_large = top_val[i].tolist()
            if top_val_large[-1] < threshold:
                for idx, val in zip(top_idx_large, top_val_large):
                    if val < threshold:
                        break
                    new_cluster.append(idx)
            else:
                # Iterate over all entries (slow)
                cos_scores = torch.mv(embeddings, embeddings[i])
                for idx, val in enumerate(cos_scores.tolist()):
                    if val >= threshold:
                        new_cluster.append(idx)
            extracted_communities.append(new_cluster)
    # Largest cluster first
    extracted_communities = sorted(extracted_communities, key=lambda x: len(x), reverse=True)
    # Step 2) Remove overlapping communities
    unique_communities = []
    extracted_ids = set()
    for community in extracted_communities:
        add_cluster = True
        for idx in community:
            if idx in extracted_ids:
                add_cluster = False
                break
        if add_cluster:
            unique_communities.append(community)
            for idx in community:
                extracted_ids.add(idx)
    return unique_communities
Read more comments on GitHub >

github_iconTop Results From Across the Web

Cosine Similarity of Neighborhoods (All Pairs, Batch)
This algorithm computes the same similarity scores as the Cosine similarity of neighborhoods, single source algorithm. Instead of selecting a single source ...
Read more >
sentence-transformers/util.py at master - GitHub
Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j. ... if isinstance(batch[key], Tensor): ... Function for Fast Community Detection.
Read more >
util — Sentence-Transformers documentation
Function for Fast Community Detection Finds in the embeddings all communities, ... Computes the cosine similarity cos_sim(a[i], b[j]) for all i and j....
Read more >
UCoDe: Unified Community Detection with Graph ... - arXiv
We propose UCoDe, a unified method for unsupervised commu- nity detection in attributed graphs. It leverages recent developments in Graph Neural ...
Read more >
Community detection and reciprocity in networks by jointly ...
We test the ability of the models to (i) recover the communities, (ii) perform edge prediction tasks and (iii) generate sample networks that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found