Associate custom data to Doc objects to be used when running custom components
See original GitHub issueHi Matthew and Ines,
I checked the documentation and the code, however I’m not able to find a solution on associating data to Doc objects that are created using either nlp(my_text)
or Doc(...)
.
In a nutshell, I have many paragraphs with a lot of data that I know must be associated with them. I also have a custom component that aims to make use of this data to perform computations on the particular associated text. However, it seems this data cannot be associated on the creation of Doc objects.
This is my pseudocode to exemplify the use case:
# Set Doc extensions
set_doc_metadata_extensions(resources)
for document in documents:
for paragraph in document:
metadata = get_metadata(paragraph)
doc = nlp(paragraph) # I want to associate here the metadata, since this is the point in which all components are ran
# Associate their values to be used in a custom component
doc._.metadata_a = value
...
doc._.metadata_b = value
Obviously, accessing the various metadata_* from the custom component gives me a default value for each metadata field (because the association of values happens after the pipeline). I cannot define them before since the Doc object doesn’t exist yet. I also tried to use the user_data parameter from the Doc object but it leads to the same.
Is there a workaround to get the expected result? If no, this can be an interesting feature request due to the use case that seems very common in my experience.
Thank you!
Your Environment
- Operating System: MacOs Sierra
- Python Version Used: 2.6
- spaCy Version Used: 2.0.10
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (1 by maintainers)
Hi @SandeepNaidu, thank you for your reply. I solved using your first suggestion: after setting custom attributes and the workflow, I created a doc object using the
make_doc
function, then I assigned custom data to them and finally I ran each component. In this way each component can access “a priori” data, even if it is the first component in the pipeline. For the ones that will face the very same problem in the future, this is the solution:It would be great if this scenario can be well documented somewhere, thank you very much indeed!
Hi Alan,
Did you try using nlp.make_doc and then send it into the pipeline? If you can propagate, you can write a custom pipeline component and intercept/position it after sent or sbd so that you can assign the properties you want there. Else before you call the pipeline, create a doc and then send the doc object into the pipeline.