Example 02_transformers_tagger_bert.ipynb is broken.
See original GitHub issueHi, the examples number 02_transformers_tagger_bert.ipynb
is currently broken at model.initialize(...)
.
Here are the stack trace:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-33-5f258b534bc1> in <module>()
5 dev_Y = list(map(model.ops.asarray, dev_Y)) # convert to cupy if needed
6
----> 7 model.initialize(X=train_X[:5], Y=train_Y[:5])
5 frames
/usr/local/lib/python3.6/dist-packages/thinc/model.py in initialize(self, X, Y)
277 validate_fwd_input_output(self.name, self._func, X, Y)
278 if self._init is not None:
--> 279 self._init(self, X=X, Y=Y)
280 return self
281
/usr/local/lib/python3.6/dist-packages/thinc/layers/chain.py in init(model, X, Y)
78 layer.initialize(X=curr_input)
79 if curr_input is not None:
---> 80 curr_input = layer.predict(curr_input)
81 if model.layers[0].has_dim("nI"):
82 model.set_dim("nI", model.layers[0].get_dim("nI"))
/usr/local/lib/python3.6/dist-packages/thinc/model.py in predict(self, X)
293 only the output, instead of the `(output, callback)` tuple.
294 """
--> 295 return self._func(self, X, is_train=False)[0]
296
297 def finish_update(self, optimizer: Optimizer) -> None:
<ipython-input-12-3b8e8a6a55e1> in forward(model, texts, is_train)
14 return_attention_masks=True,
15 return_input_lengths=True,
---> 16 return_tensors="pt",
17 )
18 return TokensPlus(**token_data), lambda d_tokens: []
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, max_length, stride, truncation_strategy, pad_to_max_length, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, **kwargs)
1725 tokens = [tokens]
1726 else:
-> 1727 tokens = self._tokenizer.encode_batch(batch_text_or_text_pairs)
1728
1729 # Convert encoding to dict
/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py in encode_batch(self, sequences)
152 A list of Encoding
153 """
--> 154 return self._tokenizer.encode_batch(sequences)
155
156 def decode(self, ids: List[int], skip_special_tokens: Optional[bool] = True) -> str:
Exception: Input must be a list[str] or list[(str, str)]
I believe there was something change from the tokenizer
from HuggingFace but I’m not so sure.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (5 by maintainers)
Top Results From Across the Web
How do I fix a .ipynb file? - Stack Overflow
A possible way to recover corrupted Jupyter notebook files, whether it contains text or not (size = 0KB), is to go to the...
Read more >first-time run of jupyter notebooks is broken (opening ipynb is ...
Bug: Notebook Editor I observed strange behavior of the jupyter notebooks when opening them the first time, that is after launching a fresh ......
Read more >Jupyter Notebook Users Manual.ipynb - Bryn Mawr College
This page describes the functionality of the Jupyter electronic document system. Jupyter documents are called "notebooks" and can be seen as many things...
Read more >The IPython Notebook — IPython 3.2.1 documentation
Displaying the result of computation using rich media representations, such as HTML, LaTeX, PNG, SVG, etc. For example, publication-quality figures rendered by ...
Read more >Markdown for Jupyter notebooks cheatsheet - IBM
Line breaks: Sometimes markdown doesn't make line breaks when you want them. ... A hyphen (-) followed by one or two spaces, for...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Looking at this again, I’m not 100% sure how the original code was supposed to work. I think there was actually a mistake and it was acting as though each token in the corpus would receive exactly one wordpiece? This obviously won’t produce decent results. We need to calculate some sort of alignment.
We can use @tamuhey’s tokenizations library to get the alignment, that’s no problem. But I want a more elegant way to actually apply it…
But even after years of numpy I still get incredibly frustrated trying to get the right magic for indexing. Honestly it’s the unhappiest I ever am while programming, I wish I could just write the stupid loop in Cython. The logic is always completely trivial. Here’s what we want:
In Cython this would be something like:
Anyway, here’s where I got to. I figured if I calculated a 2d mask there would be some way to set the rows in one go. But I can’t seem to figure out how to do it (╯°□°)╯︵ ┻━┻
We want to take the 3d tensor produced by the transformer, and select a 3d subtensor according to the mask. We then want to split it using the lengths of the original inputs, so that we get a
List[Floats2d]
that matches up to the labels. Then we need to do the opposite in the backward pass: we need to take theList[Floats2d]
of the gradients, pad it, and then use the mask to create an expanded tensor that matches the transformer output.Thanks for the report, I’ll look into it