question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Clustering for domain specific content

See original GitHub issue

Hi! massive great work on sentence-transformers mate!

I would like to use sentence-transformers for a clustering task of a domain specific topic, but I’m a bit lost on how to further proceed… Right now, I have Swedish-bert model that is fined-tuned through MLM for a specific domain dataset. I have tried to use the multilingual model from SBert and also my own fine-tuned swedish-bert model for clustering, were I directly load the models and use UMAP + HDBSCAN.

However, it is here I get confused… When I run this code below to add pool and dense-layer:

word_embedding_model = models.Transformer(model, max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=256, activation_function=nn.Tanh())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])

Should I just directly start clustering from this point on my own model? But, the pretrained models on SBert are fine-tuned on labelled data to help the modell to obtain more accurate sentence reprenstations right?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
nreimerscommented, Jun 23, 2021

@Borg93 No.

I talked about here: https://www.youtube.com/watch?v=0RV-q0--NLs

The most suitable loss it the MultipleNegativesRankingLoss, where you need pairs (anchor, positive) of two texts that are similar. This can for example be (question, answer) or (question, duplicate_question).

Here you see good performance gains there more (diverse) training data you add.

STSb is a horrible dataset, it is extremely small and the sentences are extremely simple (like “A man is eating pasta”). The resulting model will struggle to understand more specialized sentences, e.g. it has no idea how tensorflow and pytorch are connected. Are these two similar concepts or dissimilar concepts?

When you train on a large, diverse corpus, you have the chance that the models will learn many concepts from various domains (tech, medicine, biology, chemistry, gaming, sports, politics, economy, …)

Language is just too diverse and complex to be captured by ~5k simple sentence pairs from the STSb train file.

0reactions
Borg93commented, Jun 23, 2021

Thanks for the explanation and videos! You just got a new subscriber on youtube 👍

Read more comments on GitHub >

github_iconTop Results From Across the Web

Research on Domain-Specific Features Clustering Based ...
Domain -Specific features clustering aims to cluster the features from related domains into K clusters. Although traditional clustering algorithms can be ...
Read more >
[2008.04646] Learning to Cluster under Domain Shift - arXiv
In this work we overcome this assumption and we address the problem of transferring knowledge from a source to a target domain when...
Read more >
Query-specific Subtopic Clustering - Computer Science
Query-specific clustering can be applied to any context-specific text clustering task, such as detecting subtopics in corpora, domain- specific taxonomy ...
Read more >
Domain Consensus Clustering for Universal Domain Adaptation
In this paper, we aim to better exploit the intrinsic struc- ture of the target domain via mining both common classes and individual...
Read more >
Learning to Cluster under Domain Shift - ECVA
content. Recently, some works [4, 9, 12] have shown that appropriately designed ... ensure clustering based on semantic labels and not on domain-specific...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found