Poor evaluation results on a dataset
See original GitHub issueHello @nreimers,
Thank you for amazingly simple to use code!
I’m trying to fine-tune the model ‘bert-base-nli-mean-tokens’ model to match user searches to job titles.
My training dataset consists of 934791 pairs of sentences and score for each pair, so I use the example for fine-tuning for the STS Benchmark (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training_stsbenchmark_continue_training.py)
I train using the parameters from example (4 epochs with batch size 16). The evaluation results I’m getting after training are the following:
2020-01-20 15:20:03 - Cosine-Similarity : Pearson: 0.0460 Spearman: 0.1820
2020-01-20 15:20:03 - Manhattan-Distance: Pearson: -0.0294 Spearman: 0.0167
2020-01-20 15:20:03 - Euclidean-Distance: Pearson: -0.0295 Spearman: 0.0169
2020-01-20 15:20:03 - Dot-Product-Similarity: Pearson: 0.0468 Spearman: 0.1853
0.18530780992075702
Which I believe means that the model has not learned useful embeddings.
Here is how my dataset looks like for one search phrase:
The distribution of the score column is:
So I would consider this as a balanced dataset.
What would you recommend as the next steps to improve the results?
- Continue the training until similarity criteria reach 0.85 as for STS example?
- Modify the model adding some layer for search_input encoding (as you recommend here: https://github.com/UKPLab/sentence-transformers/issues/96#issuecomment-574051231)
Any other advice would be helpful.
Thank you!
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (3 by maintainers)
Hi @anatoly-khomenko I’m afraid that creating an asymmetric structure is not straight-forward, as the architecture was more designed for symmetric network structures.
What you can do is to create a new layer derived from the dense models.Dense module (let’s call it AsymmetricDense). Your architecture will look like this: Input -> BERT -> mean pooling -> AsymmetricDense
In AsymmetricDense, in the forward method, you have a special routine depending on a flag of your input:
Then you need a special reader. For your queries, you set the feature[‘input_type’] to ‘query’, for your documents (your titles), you set it to feature[‘input_type’] = ‘document’.
The dense layer will then only be applied to input text with input_type==document.
Hi @anatoly-khomenko Some notes:
The STS dataset has scores between 0 - 5. Hence, the STS reader normalizes the scores by dividing them by 5 so that you get scores between 0-1. If you haven’t disabled it (you can pass a false as a parameter), your scores would be normalized to the rand 0 - 0.1 (I think 0.5 is your highest score?)
You have an asymmetric use case: It makes a difference what text the query is and what text the response is, i.e., swapping both would make a difference in your case. The models here are optimized for symmetric use case, i.e, sim(A, B) = sim(B, A)
For your task, using an asymmetric structure could be helpful. You add one (or more) dense layers to one part of the network. So for example A -> BERT -> Mean-Pooling -> Output B -> BERT -> Mean-Pooling -> Dense -> Output
Even if A and B are identical, B would get a different sentence embedding because one is the query and the other is the document.
Here, cat in the document and cat in the search query would get different vector representations, making it more challenging to match them. Non-contenxualized word embeddings like GloVe are easier to use in this case, as ‘cat’ is always mapped to the same point in vector space.