Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Batch does not carry index

See original GitHub issue

Use Case: replace_unk most strategies of replacing <unk> tokens rely on aligning with the source sequence before numericialize

Problem: Using the Batch object, you are unable to retrieve the original text before padding and numericialize. There are no indexes stored with the batch to retrieve the original text in the dataset.

Quick work around: Define a field in dataset that is an ‘index’ field. While building your dataset, pass in indexes for each item.

Batch will then allow you to look up an index attribute.

Issue Analytics

State:
Created 6 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

honnibalcommented, Aug 7, 2017

We could only do this if all info in a Doc object is uniquely determined by the vocabulary index, which would somewhat defeat the purpose of the Doc here.

Tokenization is fully reversible if you have (orth_id, has_space) pairs. If you wanted a single sequence of ints, you would double the number of entries in the vocab in theory. Of course the extra bit introduces little extra entropy given the word ID.

So, spaCy’s tokenizers are already fully reversible. You could use them as an internal mechanism to solve this, if you like 😃. It doesn’t have to change your user-facing API, I don’t think.

In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption. We could go the other way and add PyTorch support to spaCy, for the (likely few?) situations where thinc doesn’t work, though!

I’m planning to add PyTorch tensors as a back-end option for thinc, in addition to Cupy. I also need to write examples of hooking PyTorch models into spaCy.

While I’m here: is it easy to pass a gradient back to a PyTorch model? Most libraries seem to communicate by loss, which makes it harder to compose them with models outside the library.

0reactions

jekbradburycommented, Aug 7, 2017

For passing a gradient back to PyTorch,var.backward has an optional grad_output argument that allows you to inject a gradient in a specific place in the computation graph. If you want to inject several gradients, you can use torch.autograd.backward((var_1, var_2), (grad_1, grad_2)) I believe.