Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Example 02_transformers_tagger_bert.ipynb is broken.

See original GitHub issue

Hi, the examples number 02_transformers_tagger_bert.ipynb is currently broken at model.initialize(...). Here are the stack trace:

---------------------------------------------------------------------------

Exception                                 Traceback (most recent call last)

<ipython-input-33-5f258b534bc1> in <module>()
      5 dev_Y = list(map(model.ops.asarray, dev_Y))  # convert to cupy if needed
      6 
----> 7 model.initialize(X=train_X[:5], Y=train_Y[:5])

5 frames

/usr/local/lib/python3.6/dist-packages/thinc/model.py in initialize(self, X, Y)
    277             validate_fwd_input_output(self.name, self._func, X, Y)
    278         if self._init is not None:
--> 279             self._init(self, X=X, Y=Y)
    280         return self
    281 

/usr/local/lib/python3.6/dist-packages/thinc/layers/chain.py in init(model, X, Y)
     78             layer.initialize(X=curr_input)
     79         if curr_input is not None:
---> 80             curr_input = layer.predict(curr_input)
     81     if model.layers[0].has_dim("nI"):
     82         model.set_dim("nI", model.layers[0].get_dim("nI"))

/usr/local/lib/python3.6/dist-packages/thinc/model.py in predict(self, X)
    293         only the output, instead of the `(output, callback)` tuple.
    294         """
--> 295         return self._func(self, X, is_train=False)[0]
    296 
    297     def finish_update(self, optimizer: Optimizer) -> None:

<ipython-input-12-3b8e8a6a55e1> in forward(model, texts, is_train)
     14             return_attention_masks=True,
     15             return_input_lengths=True,
---> 16             return_tensors="pt",
     17         )
     18         return TokensPlus(**token_data), lambda d_tokens: []

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, max_length, stride, truncation_strategy, pad_to_max_length, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, **kwargs)
   1725                 tokens = [tokens]
   1726             else:
-> 1727                 tokens = self._tokenizer.encode_batch(batch_text_or_text_pairs)
   1728 
   1729         # Convert encoding to dict

/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in encode_batch(self, sequences)
    152             A list of Encoding
    153         """
--> 154         return self._tokenizer.encode_batch(sequences)
    155 
    156     def decode(self, ids: List[int], skip_special_tokens: Optional[bool] = True) -> str:

Exception: Input must be a list[str] or list[(str, str)]

I believe there was something change from the tokenizer from HuggingFace but I’m not so sure.

Issue Analytics

State:
Created 4 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

honnibalcommented, May 9, 2020

Looking at this again, I’m not 100% sure how the original code was supposed to work. I think there was actually a mistake and it was acting as though each token in the corpus would receive exactly one wordpiece? This obviously won’t produce decent results. We need to calculate some sort of alignment.

We can use @tamuhey’s tokenizations library to get the alignment, that’s no problem. But I want a more elegant way to actually apply it…

But even after years of numpy I still get incredibly frustrated trying to get the right magic for indexing. Honestly it’s the unhappiest I ever am while programming, I wish I could just write the stupid loop in Cython. The logic is always completely trivial. Here’s what we want:


def fill_aligned(dest: Floats3d, src: Floats3d, indices: List[List[List[int]]]:
    for i, doc_aligns in enumerate(indices):
        for dest_j, token_aligns in enumerate(doc_aligns):
            for src_j in token_aligns:
                dest[i, dest_j] = src[i, src_j]

In Cython this would be something like:


cdef void fill_aligned(float* dest, const float* src, const int* alignment, int n_tokens, int stride) nogil:
    for i in range(n_tokens):
        memcpy(&dest[i*stride], &src[j*stride], stride*sizeof(src[0]))

Anyway, here’s where I got to. I figured if I calculated a 2d mask there would be some way to set the rows in one go. But I can’t seem to figure out how to do it (╯°□°)╯︵ ┻━┻

import tokenizations

def get_alignment_mask(orig_tokens, tokenizer, input_ids):
    """Create a boolean mask indicating whether the wordpiece ends a token."""
    wp_tokens = [tokenizer.convert_ids_to_tokens(row) for row in input_ids]
    mask = numpy.zeros(input_ids.shape, dtype="b")
    for i, (orig, wp) in enumerate(zip(orig_tokens, wp_tokens)):
        a2b, b2a = tokenizations.get_alignments(orig, wp)
        for j, j_pieces in enumerate(a2b):
            if j_pieces:
                mask[i, j_pieces[-1]] = 1
    return mask

We want to take the 3d tensor produced by the transformer, and select a 3d subtensor according to the mask. We then want to split it using the lengths of the original inputs, so that we get a List[Floats2d] that matches up to the labels. Then we need to do the opposite in the backward pass: we need to take the List[Floats2d] of the gradients, pad it, and then use the mask to create an expanded tensor that matches the transformer output.

1reaction

honnibalcommented, Feb 21, 2020

Thanks for the report, I’ll look into it

Top Results From Across the Web

How do I fix a .ipynb file? - Stack Overflow

A possible way to recover corrupted Jupyter notebook files, whether it contains text or not (size = 0KB), is to go to the...

first-time run of jupyter notebooks is broken (opening ipynb is ...

Bug: Notebook Editor I observed strange behavior of the jupyter notebooks when opening them the first time, that is after launching a fresh ......

Jupyter Notebook Users Manual.ipynb - Bryn Mawr College

This page describes the functionality of the Jupyter electronic document system. Jupyter documents are called "notebooks" and can be seen as many things...

The IPython Notebook — IPython 3.2.1 documentation

Displaying the result of computation using rich media representations, such as HTML, LaTeX, PNG, SVG, etc. For example, publication-quality figures rendered by ...

Markdown for Jupyter notebooks cheatsheet - IBM

Line breaks: Sometimes markdown doesn't make line breaks when you want them. ... A hyphen (-) followed by one or two spaces, for...