Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Associate custom data to Doc objects to be used when running custom components

See original GitHub issue

Hi Matthew and Ines, I checked the documentation and the code, however I’m not able to find a solution on associating data to Doc objects that are created using either nlp(my_text) or Doc(...). In a nutshell, I have many paragraphs with a lot of data that I know must be associated with them. I also have a custom component that aims to make use of this data to perform computations on the particular associated text. However, it seems this data cannot be associated on the creation of Doc objects.

This is my pseudocode to exemplify the use case:

# Set Doc extensions
set_doc_metadata_extensions(resources)

for document in documents:
  for paragraph in document:
    metadata = get_metadata(paragraph)
    doc = nlp(paragraph)   # I want to associate here the metadata, since this is the point in which all components are ran

    # Associate their values to be used in a custom component
    doc._.metadata_a = value
    ...
    doc._.metadata_b = value

Obviously, accessing the various metadata_* from the custom component gives me a default value for each metadata field (because the association of values happens after the pipeline). I cannot define them before since the Doc object doesn’t exist yet. I also tried to use the user_data parameter from the Doc object but it leads to the same.

Is there a workaround to get the expected result? If no, this can be an interesting feature request due to the use case that seems very common in my experience.

Thank you!

Your Environment

Operating System: MacOs Sierra
Python Version Used: 2.6
spaCy Version Used: 2.0.10

Issue Analytics

State:
Created 5 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

alanramponicommented, Jun 19, 2018

Hi @SandeepNaidu, thank you for your reply. I solved using your first suggestion: after setting custom attributes and the workflow, I created a doc object using the make_doc function, then I assigned custom data to them and finally I ran each component. In this way each component can access “a priori” data, even if it is the first component in the pipeline. For the ones that will face the very same problem in the future, this is the solution:

# Create a Doc object
doc = nlp.make_doc(paragraph)

# Associate the metadata
doc._.A = value_A
doc._.B = value_B

# Run each component
for name, proc in nlp.pipeline:
    doc = proc(doc)

It would be great if this scenario can be well documented somewhere, thank you very much indeed!

1reaction

SandeepNaiducommented, Jun 19, 2018

Hi Alan,

Did you try using nlp.make_doc and then send it into the pipeline? If you can propagate, you can write a custom pipeline component and intercept/position it after sent or sbd so that you can assign the properties you want there. Else before you call the pipeline, create a doc and then send the doc object into the pipeline.