Documents meta data within the component life cycle
See original GitHub issueApparently, components lack the possibility to act upon text meta data. I think a document, besides the words it consists of, might be highly influenced by its meta data too. It would be useful to be able to build component’s logic based on meta data.
When text meta data matters?
For instance, suppose we have news articles. Each article could have tags (politics, lifestyle, world, tech), publication date and titles.
A reasonable task would be to create a component that labels article entities based on the tags of the article. Entities like “core”, “static”, “local” and “dynamic” make more sense specifically for the “tech” tag, than to any other one.
Another component might find diseases mentioned in articles, and Coronavirus would make more sense to be found in articles with publication dates greater than March 2020 (prior to this date, any “Corona” has more changes to describe the beer).
In addition, article titles may also add some importance to their entities.
All of the examples above use data that doesn’t reside inside the text itself, hence the term meta data.
What do we have?
Currently, there doesn’t seem to be an easy way to get any meta data inside the components’ workflow.
The closest to working with meta data we get, is the context
, which we use with nlp.pipe
when setting as_tuples=True
. But this happens after all components have already finished their job, which forces us to either use code outside the components convention, or to manually call the components (without nlp.add_pipe
and all the benefits of disabling pipes).
The argument component_cfg
won’t help us neither to pass data per each document. (Moreover, its terminology and usage are component oriented, while meta data should be document oriented).
Solution
Please see my fork and compare. The text_meta
option in nlp.pipe
assigns the meta data to the Doc
, right before the components take place. That means, the components would be able to build logic based on the meta data.
Example:
import datetime
article1 = article2 = article3 = article4 = 'This is just a test article'
date1 = date2 = date3 = date4 = datetime.datetime(2020, 7, 2)
texts = [article1, article2, article3, article4]
texts_meta = [
{'date': date1, 'tags': ['politics']},
{'date': date2, 'tags': ['lifestyle']},
{'date': date3, 'tags': ['world']},
{'date': date4, 'tags': ['tech', 'hardware']}
]
for doc in nlp.pipe(texts, texts_meta=texts_meta): # now components can access text meta data
meta = doc.text_meta # it's also possible to access text meta after the components' life cycle
Side notes
- In my solution, I added
text_meta
property to theDoc
. It is possible to leave theDoc
untouched, while using the existinguser_data
property instead, but its usage wasn’t clear to me, and there might be possible conflicts. - I consider whether adding
user_
prefix like:user_text_meta
might be better suited with the convention to denote that the user is in charge of the content of this property (and it is not being automatically populated by the library). - Do you think
context
andas_tuples
might be useless because of it, or do they still serve another goal? - I’ll be glad to hear your opinions, and whether I’ve missed something on the way.
- If it seems like a good idea, I will be honored to fine tune the code and make a pull request.
What do you think?
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (6 by maintainers)
Top GitHub Comments
Hello @adrianeboyd, thank you for your response. It’s clever, and makes the work done.
Wouldn’t you see any benefit to include this meta behavior as an integrated part of the
language
class, like in my solution? Same to theas_tuples
, and the new option I added:use_tuples_as_texts_meta
. Just curious to hear your opinion. Thanks!This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.