Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Clustering for domain specific content

See original GitHub issue

Hi! massive great work on sentence-transformers mate!

I would like to use sentence-transformers for a clustering task of a domain specific topic, but I’m a bit lost on how to further proceed… Right now, I have Swedish-bert model that is fined-tuned through MLM for a specific domain dataset. I have tried to use the multilingual model from SBert and also my own fine-tuned swedish-bert model for clustering, were I directly load the models and use UMAP + HDBSCAN.

However, it is here I get confused… When I run this code below to add pool and dense-layer:

word_embedding_model = models.Transformer(model, max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=256, activation_function=nn.Tanh())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])

Should I just directly start clustering from this point on my own model? But, the pretrained models on SBert are fine-tuned on labelled data to help the modell to obtain more accurate sentence reprenstations right?

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

nreimerscommented, Jun 23, 2021

@Borg93 No.

I talked about here: https://www.youtube.com/watch?v=0RV-q0--NLs

The most suitable loss it the MultipleNegativesRankingLoss, where you need pairs (anchor, positive) of two texts that are similar. This can for example be (question, answer) or (question, duplicate_question).

Here you see good performance gains there more (diverse) training data you add.

STSb is a horrible dataset, it is extremely small and the sentences are extremely simple (like “A man is eating pasta”). The resulting model will struggle to understand more specialized sentences, e.g. it has no idea how tensorflow and pytorch are connected. Are these two similar concepts or dissimilar concepts?

When you train on a large, diverse corpus, you have the chance that the models will learn many concepts from various domains (tech, medicine, biology, chemistry, gaming, sports, politics, economy, …)

Language is just too diverse and complex to be captured by ~5k simple sentence pairs from the STSb train file.

0reactions

Borg93commented, Jun 23, 2021

Thanks for the explanation and videos! You just got a new subscriber on youtube 👍

Top Results From Across the Web

Research on Domain-Specific Features Clustering Based ...

Domain -Specific features clustering aims to cluster the features from related domains into K clusters. Although traditional clustering algorithms can be ...

[2008.04646] Learning to Cluster under Domain Shift - arXiv

In this work we overcome this assumption and we address the problem of transferring knowledge from a source to a target domain when...

Query-specific Subtopic Clustering - Computer Science

Query-specific clustering can be applied to any context-specific text clustering task, such as detecting subtopics in corpora, domain- specific taxonomy ...

Domain Consensus Clustering for Universal Domain Adaptation

In this paper, we aim to better exploit the intrinsic struc- ture of the target domain via mining both common classes and individual...

Learning to Cluster under Domain Shift - ECVA

content. Recently, some works [4, 9, 12] have shown that appropriately designed ... ensure clustering based on semantic labels and not on domain-specific...