Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Custom tokenizer layer

See original GitHub issue

Hi, I would like to incorporate the tokenization process into a model which is using bert layer. Here is my custom layer:

class TokenizationLayer(tf.keras.layers.Layer):
    def __init__(self, vocab_path, max_length, **kwargs):
        self.vocab_path = vocab_path
        self.length = max_length
        self.tokenizer = bert.bert_tokenization.FullTokenizer(vocab_path, do_lower_case=False)
        super(TokenizationLayer, self).__init__(**kwargs)

    def call(self,inputs):
        tokens = self.tokenizer.tokenize(inputs)
        ids = self.tokenizer.convert_tokens_to_ids(tokens)
        ids += [self.tokenizer.vocab['[PAD]']] * (self.length-len(ids))
        return ids

And here is my code to test the custom layer within a dummy model:

inputs = tf.keras.layers.Input(shape=(), dtype='string')
tokenization_layer = TokenizationLayer(vocab_path, 10, dtype=tf.string)
outputs = tokenization_layer(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=outputs)

I get the following traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-68-8df4885e5c7a> in <module>
      1 inputs = tf.keras.layers.Input(shape=(), dtype='string')
      2 tokenization_layer = TokenizationLayer(vocab_path, 10, dtype=tf.string)
----> 3 outputs = tokenization_layer(inputs)
      4 model = tf.keras.models.Model(inputs=inputs, outputs=outputs)

/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py in __call__(self, *args, **kwargs)
    924     if _in_functional_construction_mode(self, inputs, args, kwargs, input_list):
    925       return self._functional_construction_call(inputs, args, kwargs,
--> 926                                                 input_list)
    927 
    928     # Maintains info about the `Layer.call` stack.

/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py in _functional_construction_call(self, inputs, args, kwargs, input_list)
   1115           try:
   1116             with ops.enable_auto_cast_variables(self._compute_dtype_object):
-> 1117               outputs = call_fn(cast_inputs, *args, **kwargs)
   1118 
   1119           except errors.OperatorNotAllowedInGraphError as e:

/s/Demo/venv/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py in wrapper(*args, **kwargs)
    256       except Exception as e:  # pylint:disable=broad-except
    257         if hasattr(e, 'ag_error_metadata'):
--> 258           raise e.ag_error_metadata.to_exception(e)
    259         else:
    260           raise

ValueError: in user code:

    <ipython-input-60-d6c12f7d1b14>:17 call  *
        tokens = self.tokenizer.tokenize(inputs)
    /s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:172 tokenize  *
        for token in self.basic_tokenizer.tokenize(text):
    /s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:198 tokenize  *
        text = convert_to_unicode(text)
    /s/Demo/venv/lib/python3.7/site-packages/bert/tokenization/bert_tokenization.py:86 convert_to_unicode  *
        raise ValueError("Unsupported string type: %s" % (type(text)))

    ValueError: Unsupported string type: <class 'tensorflow.python.framework.ops.Tensor'>

Can you lease help how to solve this issue? I think the problem is that the tokenizer gets tensors not string and that is why it can’t tokenize it. But if that is the case how should I mkae this work? Thanks

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

2reactions

kpecommented, Oct 21, 2020

hope that helps:

pip install tensorflow-text

and then try something along those lines:

import tensorflow_text as text

tokenizer = text.BertTokenizer(os.path.join(ckpt_dir, 'vocab.txt'))
tok_ids = tokenizer.tokenize(["hello, cruel world!", "abcccccccd"]).merge_dims(-2,-1).to_tensor(shape=(2, max_seq_len))

0reactions

keesoncommented, Jan 26, 2021

it didn’t work， still throw OperatorNotAllowedInGraphError