Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Normalize before using LogisticRegression

See original GitHub issue

Hi,

as far as I can see is that setfit applies LogisticRegression on top of the output of the sentence transformer model. See here:

https://github.com/huggingface/setfit/blob/7735e8e3b208edb8dfb549beb16e585453c5f44e/src/setfit/modeling.py#L49-L55

The problem I see is that the output is not normalized by default. Since we use Cosine Sim. to compare embeddings the length of the vector does not matter. When you do Cosine Sim. this is ok but it is IMO not ok when you apply LogisticRegression.

IMO the embeddings should be normalized to unit length before LogisticRegression is applied. That would be done by passing normalize_embeddings=True to the encode function. See here:

https://github.com/UKPLab/sentence-transformers/blob/0422a5e07a5a998948721dea435235b342a9f610/sentence_transformers/SentenceTransformer.py#L111-L118

What do you think? I can provide a PR if wanted.

Issue Analytics

State:
Created a year ago
Comments:5 (4 by maintainers)

Top GitHub Comments

2reactions

doubianimehdicommented, Oct 30, 2022

That explains those messages i’m getting : when using fit and trying to predict :

_logistic.py:444: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(

`df_ri_manclassif[‘predicted’]= model(df_ri_manclassif[‘global_text’].to_list()) this is the message : --------------------------------------------------------------------------- NotFittedError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_7856/2981536964.py in ----> 1 df_ri_manclassif[‘predicted’]= model(df_ri_manclassif[‘global_text’].to_list())

c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\setfit\modeling.py in call(self, inputs) 60 def call(self, inputs): 61 embeddings = self.model_body.encode(inputs) —> 62 return self.model_head.predict(embeddings) 63 64 def _save_pretrained(self, save_directory):

c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model_base.py in predict(self, X) 445 Vector containing the class labels for each sample. 446 “”" –> 447 scores = self.decision_function(X) 448 if len(scores.shape) == 1: 449 indices = (scores > 0).astype(int)

c:\Users\doub2420\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\linear_model_base.py in decision_function(self, X) 425 this class would be predicted. 426 “”" –> 427 check_is_fitted(self) 428 429 X = self._validate_data(X, accept_sparse=“csr”, reset=False) … -> 1345 raise NotFittedError(msg % {“name”: type(estimator).name}) 1346 1347`

1reaction

PhilipMaycommented, Oct 31, 2022

Well, I made a branch where I can enable normalize.

I did some tests with default settings, a normal BERT model (no pretrained sentence embedding model) and optuna.

Letting optuna optimize also the normalize parameter (True or False) shows that it is definitly better when NOT doing it.

This is strange since it means that the length of the vector seems to encode important information…

See here:

Top Results From Across the Web

Is standardization needed before fitting logistic regression?

You don't need to standardize unless your regression is regularized. However, it sometimes helps interpretability, and rarely hurts. – alex. Jan ...

Effects of Normalization Techniques on Logistic Regression

Three alternative normalization procedures were used to evaluate the performance of the logistic regression model. Normalizing a dataset is intended to improve ...

Should input data to logistic regression be normalized? - Quora

No, logistic regression does not require any particular distribution for the independent variables. They can be normal, skewed, categorical or whatever. No ...

How, When, and Why Should You Normalize / Standardize ...

“Normalizing” a vector most often means dividing by a norm of the vector. It also often refers to rescaling by the minimum and...

Normalization with logistic function? - ResearchGate

In logistics numeric features should be normalized so that each feature contributes approximately proportionately to the final distance. This can provide ...