Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot specify max_length more than 512

See original GitHub issue

Hello,

I’ve tried to use max_length more than 512 to featurize text:

model = finetune.Classifier()
trn_X_q_vecs = model.featurize(trn_X_q, max_length=1000)

But I got the following exception:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-d3d9e8b820e5> in <module>()
----> 1 trn_X_q_vecs = model.featurize(trn_X_q, max_length=1000)

/opt/conda/lib/python3.6/site-packages/finetune/classifier.py in featurize(self, X, max_length)
     24         :returns: np.array of features of shape (n_examples, embedding_size).
     25         """
---> 26         return super().featurize(X, max_length=max_length)
     27 
     28     def predict(self, X, max_length=None):

/opt/conda/lib/python3.6/site-packages/finetune/base.py in featurize(self, *args, **kwargs)
    386         These features are the same that are fed into the target_model.
    387         """
--> 388         return self._featurize(*args, **kwargs)
    389 
    390     @classmethod

/opt/conda/lib/python3.6/site-packages/finetune/base.py in _featurize(self, Xs, max_length)
    371             warnings.filterwarnings("ignore")
    372             max_length = max_length or self.config.max_length
--> 373             for xmb, mmb in self._infer_prep(Xs, max_length=max_length):
    374                 feature_batch = self.sess.run(self.features, {
    375                     self.X: xmb,

/opt/conda/lib/python3.6/site-packages/finetune/base.py in _infer_prep(self, Xs, max_length)
    400     def _infer_prep(self, Xs, max_length=None):
    401         max_length = max_length or self.config.max_length
--> 402         arr_encoded = self._text_to_ids(Xs, max_length=max_length)
    403         n_batch_train = self.config.batch_size * max(len(self.config.visible_gpus), 1)
    404         self._build_model(n_updates_total=0, target_dim=self.target_dim, train=False)

/opt/conda/lib/python3.6/site-packages/finetune/base.py in _text_to_ids(self, Xs, Y, max_length)
    156         else:
    157             encoder_out = self.encoder.encode_multi_input(Xs, Y=Y, max_length=max_length)
--> 158             return self._array_format(encoder_out)
    159 
    160 

/opt/conda/lib/python3.6/site-packages/finetune/base.py in _array_format(self, encoded_output)
    421         for i, seq_length in enumerate(seq_lengths):
    422             # BPE embedding
--> 423             x[i, :seq_length, 0] = encoded_output.token_ids[i]
    424             # masking: value of 1 means "consider this in cross-entropy LM loss"
    425             mask[i, 1:seq_length] = 1

ValueError: cannot copy sequence with size 667 to array axis with dimension 512

Issue Analytics

State:
Created 5 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

madisonmaycommented, Sep 7, 2018

@thinline72 also thanks for the recommendation to take a peek at sentencepiece! Might have to use that the next time we’re training a model from scratch – tokenization + de-tokenization has been a huge pain in this repository, perhaps a project like sentencepiece could help clean that up.

1reaction

madisonmaycommented, Sep 7, 2018

@thinline72 correct, this was a necessity inherited from the original OpenAI repository. We’re keeping our handling of tokenization as close to possible to the source implementation to prevent performance regressions.