Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How properly apply a tokenizer map function to a Tensorflow batched dataset?

See original GitHub issue

Considering the following batched_dataset:

samples =  ([{"query": "this is a query 1", "doc": "this is one relevant document regarding query 1"}, 
              {"query": "this is a query 2", "doc": "this is one relevant document regarding query 2"},
              {"query": "this is a query 3", "doc": "this is one relevant document regarding query 3"},
              {"query": "this is a query 4", "doc": "this is one relevant document regarding query 4"},
              ])
dataset = tf.data.Dataset.from_generator( 
    lambda: samples, {"query": tf.string, "doc": tf.string})

batched_dataset = dataset.batch(2)

#{
#'doc': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
#     [b'this is one relevant document regarding query 1',
#      b'this is one relevant document regarding query 2'], dtype=object)>,
# 
#'query': <tf.Tensor: shape=(2,), dtype=string, numpy=array(
#     [b'this is a query 1', 
#      b'this is a query 2'], dtype=object)>
#}

and a map function to tokenize this batched_dataset:

def tokenize(sample):
    tokenized_query = tokenizer.batch_encode_plus(sample["query"].numpy().astype('str'), ...)
    tokenized_doc = tokenizer.batch_encode_plus(sample["doc"].numpy().astype('str'), ...)
    return (tokenized_query, tokenized_doc)

I could tokenize the entire batched_dataset using a for-loop:

for batch in batched_dataset:
    tokenize(batch)
# (
# {'input_ids': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[  101,  2023,  2003,  1037, 23032,  1015,   102,     0],
#          [  101,  2023,  2003,  1037, 23032,  1016,   102,     0]],
#      dtype=int32)>, 
#  'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[1, 1, 1, 1, 1, 1, 1, 0],
#          [1, 1, 1, 1, 1, 1, 1, 0]], dtype=int32)>}, 

# {'input_ids': <tf.Tensor: shape=(2, 8), #dtype=int32, numpy=
#   array([[ 101, 2023, 2003, 2028, 7882, 6254, 4953,  102],
#          [ 101, 2023, 2003, 2028, 7882, 6254, 4953,  102]], dtype=int32)>, 
#  'attention_mask': <tf.Tensor: shape=(2, 8), dtype=int32, numpy=
#   array([[1, 1, 1, 1, 1, 1, 1, 1],
#          [1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)>})
#  ...

However, when using tf.data.Dataset.map the following error arises:

tokenized_dataset = batched_dataset.map(tokenize)
AttributeError: 'Tensor' object has no attribute 'numpy'

Then, how properly apply a tokenizer map function to a batched dataset?

Note: I published a working example on Google Colab.

Issue Analytics

State:
Created 3 years ago
Comments:12 (4 by maintainers)

Top GitHub Comments

3reactions

celsofranssacommented, Apr 24, 2020

The ideal would be to follow the pipeline (read from the file >> generate batches >> tokenize >> train >> evaluate). It is the most efficient approach as pointed in the TensorFlow tutorial.

Tensorflow when dealing with texts generates string tensors that are stored as byte string:

<tf.Tensor: shape=(2,), dtype=string, numpy=array(
     [b'Thê first utf-8 string of the batçh.',
      b'Thê secônd utf-8 string of the batçh.'], dtype=object)>

However, I didn’t find an efficient way to decode this kind of tensor as a list of strings. It’s even worse if the byte string containing a non-ascii character.

What I really need is one of these two options:

a tokenizer which is able to accept aforementioned byte string tensor as input to tokenize; or
a vectorized approach to transforming a byte string tensor into a string list.

Thank you very much for all your help.

2reactions

LysandreJikcommented, Apr 23, 2020

This seems like more of a TF-related question rather than a Transformers-related question. The issue seems to stem from your code trying to get the value of a tensor which is not eager, using numpy. I believe the tf.data.Dataset.map method must trace inputs, resulting in the Tensors not being eager.

Couldn’t you build the tf.data.Dataset with already tokenized inputs instead?