question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Batch does not carry index

See original GitHub issue

Use Case: replace_unk most strategies of replacing <unk> tokens rely on aligning with the source sequence before numericialize

Problem: Using the Batch object, you are unable to retrieve the original text before padding and numericialize. There are no indexes stored with the batch to retrieve the original text in the dataset.

Quick work around: Define a field in dataset that is an ‘index’ field. While building your dataset, pass in indexes for each item.

Batch will then allow you to look up an index attribute.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
honnibalcommented, Aug 7, 2017

We could only do this if all info in a Doc object is uniquely determined by the vocabulary index, which would somewhat defeat the purpose of the Doc here.

Tokenization is fully reversible if you have (orth_id, has_space) pairs. If you wanted a single sequence of ints, you would double the number of entries in the vocab in theory. Of course the extra bit introduces little extra entropy given the word ID.

So, spaCy’s tokenizers are already fully reversible. You could use them as an internal mechanism to solve this, if you like 😃. It doesn’t have to change your user-facing API, I don’t think.

In general integrating more with the spaCy core APIs is not a bad idea, but it would also make for a lot of breaking changes just as torchtext is picking up a little adoption. We could go the other way and add PyTorch support to spaCy, for the (likely few?) situations where thinc doesn’t work, though!

I’m planning to add PyTorch tensors as a back-end option for thinc, in addition to Cupy. I also need to write examples of hooking PyTorch models into spaCy.

While I’m here: is it easy to pass a gradient back to a PyTorch model? Most libraries seem to communicate by loss, which makes it harder to compose them with models outside the library.

0reactions
jekbradburycommented, Aug 7, 2017

For passing a gradient back to PyTorch,var.backward has an optional grad_output argument that allows you to inject a gradient in a specific place in the computation graph. If you want to inject several gradients, you can use torch.autograd.backward((var_1, var_2), (grad_1, grad_2)) I believe.

Read more comments on GitHub >

github_iconTop Results From Across the Web

can't access array index in batch file - Stack Overflow
What makes you think batch files have array indices? (Hint: they don't!) You can sort of get something like what you're trying: drop...
Read more >
Batch Error: "Index was outside the bounds of the array" #486
I'm trying to use batch to delete a bunch of lists. Sometimes the list is not there but I want the ones that...
Read more >
How to Index from Batch in Document Manager - TeamDynamix
Select the application that contains the batch that will be indexed by either: · From the menu, choose Utilities > Batch Index (List)....
Read more >
I am indexing a batch that says for your reference do not index ...
Hi,. It looks like you've started to index one of the reference images that are either side of the one you should be...
Read more >
Spring Batch - Reference Documentation
Spring Batch is not a scheduling framework. ... This is not a problem as long as the jobs are not sharing the same...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found