question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Normalize before using LogisticRegression

See original GitHub issue

Hi,

as far as I can see is that setfit applies LogisticRegression on top of the output of the sentence transformer model. See here:

https://github.com/huggingface/setfit/blob/7735e8e3b208edb8dfb549beb16e585453c5f44e/src/setfit/modeling.py#L49-L55

The problem I see is that the output is not normalized by default. Since we use Cosine Sim. to compare embeddings the length of the vector does not matter. When you do Cosine Sim. this is ok but it is IMO not ok when you apply LogisticRegression.

IMO the embeddings should be normalized to unit length before LogisticRegression is applied. That would be done by passing normalize_embeddings=True to the encode function. See here:

https://github.com/UKPLab/sentence-transformers/blob/0422a5e07a5a998948721dea435235b342a9f610/sentence_transformers/SentenceTransformer.py#L111-L118

What do you think? I can provide a PR if wanted.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
doubianimehdicommented, Oct 30, 2022

That explains those messages i’m getting : when using fit and trying to predict :

_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(

`df_ri_manclassif[‘predicted’]= model(df_ri_manclassif[‘global_text’].to_list()) this is the message : --------------------------------------------------------------------------- NotFittedError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_7856/2981536964.py in ----> 1 df_ri_manclassif[‘predicted’]= model(df_ri_manclassif[‘global_text’].to_list())

c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\setfit\modeling.py in call(self, inputs) 60 def call(self, inputs): 61 embeddings = self.model_body.encode(inputs) —> 62 return self.model_head.predict(embeddings) 63 64 def _save_pretrained(self, save_directory):

c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model_base.py in predict(self, X) 445 Vector containing the class labels for each sample. 446 “”" –> 447 scores = self.decision_function(X) 448 if len(scores.shape) == 1: 449 indices = (scores > 0).astype(int)

c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model_base.py in decision_function(self, X) 425 this class would be predicted. 426 “”" –> 427 check_is_fitted(self) 428 429 X = self._validate_data(X, accept_sparse=“csr”, reset=False) … -> 1345 raise NotFittedError(msg % {“name”: type(estimator).name}) 1346 1347`

1reaction
PhilipMaycommented, Oct 31, 2022

Well, I made a branch where I can enable normalize.

I did some tests with default settings, a normal BERT model (no pretrained sentence embedding model) and optuna.

Letting optuna optimize also the normalize parameter (True or False) shows that it is definitly better when NOT doing it.

This is strange since it means that the length of the vector seems to encode important information…

See here:

image
Read more comments on GitHub >

github_iconTop Results From Across the Web

Is standardization needed before fitting logistic regression?
You don't need to standardize unless your regression is regularized. However, it sometimes helps interpretability, and rarely hurts. – alex. Jan ...
Read more >
Effects of Normalization Techniques on Logistic Regression
Three alternative normalization procedures were used to evaluate the performance of the logistic regression model. Normalizing a dataset is intended to improve ...
Read more >
Should input data to logistic regression be normalized? - Quora
No, logistic regression does not require any particular distribution for the independent variables. They can be normal, skewed, categorical or whatever. No ...
Read more >
How, When, and Why Should You Normalize / Standardize ...
“Normalizing” a vector most often means dividing by a norm of the vector. It also often refers to rescaling by the minimum and...
Read more >
Normalization with logistic function? - ResearchGate
In logistics numeric features should be normalized so that each feature contributes approximately proportionately to the final distance. This can provide ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found