Normalize before using LogisticRegression
See original GitHub issueHi,
as far as I can see is that setfit applies LogisticRegression on top of the output of the sentence transformer model. See here:
The problem I see is that the output is not normalized by default. Since we use Cosine Sim. to compare embeddings the length of the vector does not matter. When you do Cosine Sim. this is ok but it is IMO not ok when you apply LogisticRegression.
IMO the embeddings should be normalized to unit length before LogisticRegression is applied. That would be done by passing
normalize_embeddings=True
to the encode
function. See here:
What do you think? I can provide a PR if wanted.
Issue Analytics
- State:
- Created a year ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Is standardization needed before fitting logistic regression?
You don't need to standardize unless your regression is regularized. However, it sometimes helps interpretability, and rarely hurts. – alex. Jan ...
Read more >Effects of Normalization Techniques on Logistic Regression
Three alternative normalization procedures were used to evaluate the performance of the logistic regression model. Normalizing a dataset is intended to improve ...
Read more >Should input data to logistic regression be normalized? - Quora
No, logistic regression does not require any particular distribution for the independent variables. They can be normal, skewed, categorical or whatever. No ...
Read more >How, When, and Why Should You Normalize / Standardize ...
“Normalizing” a vector most often means dividing by a norm of the vector. It also often refers to rescaling by the minimum and...
Read more >Normalization with logistic function? - ResearchGate
In logistics numeric features should be normalized so that each feature contributes approximately proportionately to the final distance. This can provide ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
That explains those messages i’m getting : when using fit and trying to predict :
_logistic.py:444:
ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i =
_check_optimize_result(
`df_ri_manclassif[‘predicted’]= model(df_ri_manclassif[‘global_text’].to_list()) this is the message : --------------------------------------------------------------------------- NotFittedError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_7856/2981536964.py in ----> 1 df_ri_manclassif[‘predicted’]= model(df_ri_manclassif[‘global_text’].to_list())
c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\setfit\modeling.py in call(self, inputs) 60 def call(self, inputs): 61 embeddings = self.model_body.encode(inputs) —> 62 return self.model_head.predict(embeddings) 63 64 def _save_pretrained(self, save_directory):
c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model_base.py in predict(self, X) 445 Vector containing the class labels for each sample. 446 “”" –> 447 scores = self.decision_function(X) 448 if len(scores.shape) == 1: 449 indices = (scores > 0).astype(int)
c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model_base.py in decision_function(self, X) 425 this class would be predicted. 426 “”" –> 427 check_is_fitted(self) 428 429 X = self._validate_data(X, accept_sparse=“csr”, reset=False) … -> 1345 raise NotFittedError(msg % {“name”: type(estimator).name}) 1346 1347`
Well, I made a branch where I can enable normalize.
I did some tests with default settings, a normal BERT model (no pretrained sentence embedding model) and optuna.
Letting optuna optimize also the normalize parameter (True or False) shows that it is definitly better when NOT doing it.
This is strange since it means that the length of the vector seems to encode important information…
See here: