question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Custom tokenizer layer

See original GitHub issue

Hi, I would like to incorporate the tokenization process into a model which is using bert layer. Here is my custom layer:

class TokenizationLayer(tf.keras.layers.Layer):
    def __init__(self, vocab_path, max_length, **kwargs):
        self.vocab_path = vocab_path
        self.length = max_length
        self.tokenizer = bert.bert_tokenization.FullTokenizer(vocab_path, do_lower_case=False)
        super(TokenizationLayer, self).__init__(**kwargs)

    def call(self,inputs):
        tokens = self.tokenizer.tokenize(inputs)
        ids = self.tokenizer.convert_tokens_to_ids(tokens)
        ids += [self.tokenizer.vocab['[PAD]']] * (self.length-len(ids))
        return ids

And here is my code to test the custom layer within a dummy model:

inputs = tf.keras.layers.Input(shape=(), dtype='string')
tokenization_layer = TokenizationLayer(vocab_path, 10, dtype=tf.string)
outputs = tokenization_layer(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=outputs)

I get the following traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-68-8df4885e5c7a> in <module>
      1 inputs = tf.keras.layers.Input(shape=(), dtype='string')
      2 tokenization_layer = TokenizationLayer(vocab_path, 10, dtype=tf.string)
----> 3 outputs = tokenization_layer(inputs)
      4 model = tf.keras.models.Model(inputs=inputs, outputs=outputs)

/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py in __call__(self, *args, **kwargs)
    924     if _in_functional_construction_mode(self, inputs, args, kwargs, input_list):
    925       return self._functional_construction_call(inputs, args, kwargs,
--> 926                                                 input_list)
    927 
    928     # Maintains info about the `Layer.call` stack.

/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py in _functional_construction_call(self, inputs, args, kwargs, input_list)
   1115           try:
   1116             with ops.enable_auto_cast_variables(self._compute_dtype_object):
-> 1117               outputs = call_fn(cast_inputs, *args, **kwargs)
   1118 
   1119           except errors.OperatorNotAllowedInGraphError as e:

/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py in wrapper(*args, **kwargs)
    256       except Exception as e:  # pylint:disable=broad-except
    257         if hasattr(e, 'ag_error_metadata'):
--> 258           raise e.ag_error_metadata.to_exception(e)
    259         else:
    260           raise

ValueError: in user code:

    <ipython-input-60-d6c12f7d1b14>:17 call  *
        tokens = self.tokenizer.tokenize(inputs)
    /s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:172 tokenize  *
        for token in self.basic_tokenizer.tokenize(text):
    /s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:198 tokenize  *
        text = convert_to_unicode(text)
    /s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:86 convert_to_unicode  *
        raise ValueError("Unsupported string type: %s" % (type(text)))

    ValueError: Unsupported string type: <class 'tensorflow.python.framework.ops.Tensor'>

Can you lease help how to solve this issue? I think the problem is that the tokenizer gets tensors not string and that is why it can’t tokenize it. But if that is the case how should I mkae this work? Thanks

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
kpecommented, Oct 21, 2020

hope that helps:

pip install tensorflow-text

and then try something along those lines:

import tensorflow_text as text

tokenizer = text.BertTokenizer(os.path.join(ckpt_dir, 'vocab.txt'))
tok_ids = tokenizer.tokenize(["hello, cruel world!", "abcccccccd"]).merge_dims(-2,-1).to_tensor(shape=(2, max_seq_len))
0reactions
keesoncommented, Jan 26, 2021

it didn’t work, still throw OperatorNotAllowedInGraphError

Read more comments on GitHub >

github_iconTop Results From Across the Web

Create a custom architecture - Hugging Face
Create a model architecture. Create a slow and fast tokenizer for text. Create an image processor for vision tasks. Create a feature extractor...
Read more >
Transformers From Scratch: Training a Tokenizer
We put together a custom tokenizer trained on the Latin subset of the huge OSCAR dataset. That's all for this article! I hope...
Read more >
Issue when saving TF model with a tokenizer as a custom layer
Hi, I am trying to create a tensorflow model with keras api, when I include the tokenizing process inside the model.
Read more >
Custom layers | TensorFlow Core
Implementing custom layers. The best way to implement your own layer is extending the tf.keras.Layer class and implementing:.
Read more >
WordPieceTokenizer - Keras
A word piece tokenizer layer. This layer provides an efficient, in graph, implementation of the WordPiece algorithm used by BERT and other models....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found