Clustering for domain specific content
See original GitHub issueHi! massive great work on sentence-transformers mate!
I would like to use sentence-transformers for a clustering task of a domain specific topic, but I’m a bit lost on how to further proceed… Right now, I have Swedish-bert model that is fined-tuned through MLM for a specific domain dataset. I have tried to use the multilingual model from SBert and also my own fine-tuned swedish-bert model for clustering, were I directly load the models and use UMAP + HDBSCAN.
However, it is here I get confused… When I run this code below to add pool and dense-layer:
word_embedding_model = models.Transformer(model, max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=256, activation_function=nn.Tanh())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])
Should I just directly start clustering from this point on my own model? But, the pretrained models on SBert are fine-tuned on labelled data to help the modell to obtain more accurate sentence reprenstations right?
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
@Borg93 No.
I talked about here: https://www.youtube.com/watch?v=0RV-q0--NLs
The most suitable loss it the MultipleNegativesRankingLoss, where you need pairs (anchor, positive) of two texts that are similar. This can for example be (question, answer) or (question, duplicate_question).
Here you see good performance gains there more (diverse) training data you add.
STSb is a horrible dataset, it is extremely small and the sentences are extremely simple (like “A man is eating pasta”). The resulting model will struggle to understand more specialized sentences, e.g. it has no idea how tensorflow and pytorch are connected. Are these two similar concepts or dissimilar concepts?
When you train on a large, diverse corpus, you have the chance that the models will learn many concepts from various domains (tech, medicine, biology, chemistry, gaming, sports, politics, economy, …)
Language is just too diverse and complex to be captured by ~5k simple sentence pairs from the STSb train file.
Thanks for the explanation and videos! You just got a new subscriber on youtube 👍