Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Documents meta data within the component life cycle

See original GitHub issue

Apparently, components lack the possibility to act upon text meta data. I think a document, besides the words it consists of, might be highly influenced by its meta data too. It would be useful to be able to build component’s logic based on meta data.

When text meta data matters?

For instance, suppose we have news articles. Each article could have tags (politics, lifestyle, world, tech), publication date and titles.

A reasonable task would be to create a component that labels article entities based on the tags of the article. Entities like “core”, “static”, “local” and “dynamic” make more sense specifically for the “tech” tag, than to any other one.

Another component might find diseases mentioned in articles, and Coronavirus would make more sense to be found in articles with publication dates greater than March 2020 (prior to this date, any “Corona” has more changes to describe the beer).

In addition, article titles may also add some importance to their entities.

All of the examples above use data that doesn’t reside inside the text itself, hence the term meta data.

What do we have?

Currently, there doesn’t seem to be an easy way to get any meta data inside the components’ workflow.

The closest to working with meta data we get, is the context, which we use with nlp.pipe when setting as_tuples=True. But this happens after all components have already finished their job, which forces us to either use code outside the components convention, or to manually call the components (without nlp.add_pipe and all the benefits of disabling pipes).

The argument component_cfg won’t help us neither to pass data per each document. (Moreover, its terminology and usage are component oriented, while meta data should be document oriented).

Solution

Please see my fork and compare. The text_meta option in nlp.pipe assigns the meta data to the Doc, right before the components take place. That means, the components would be able to build logic based on the meta data.

Example:

import datetime

article1 = article2 = article3 = article4 = 'This is just a test article'
date1 = date2 = date3 = date4 = datetime.datetime(2020, 7, 2)

texts = [article1, article2, article3, article4]
texts_meta = [
    {'date': date1, 'tags': ['politics']},
    {'date': date2, 'tags': ['lifestyle']},
    {'date': date3, 'tags': ['world']},
    {'date': date4, 'tags': ['tech', 'hardware']}
]

for doc in nlp.pipe(texts, texts_meta=texts_meta):  # now components can access text meta data
    meta = doc.text_meta  # it's also possible to access text meta after the components' life cycle

Side notes

In my solution, I added text_meta property to the Doc. It is possible to leave the Doc untouched, while using the existing user_data property instead, but its usage wasn’t clear to me, and there might be possible conflicts.
I consider whether adding user_ prefix like: user_text_meta might be better suited with the convention to denote that the user is in charge of the content of this property (and it is not being automatically populated by the library).
Do you think context and as_tuples might be useless because of it, or do they still serve another goal?
I’ll be glad to hear your opinions, and whether I’ve missed something on the way.
If it seems like a good idea, I will be honored to fine tune the code and make a pull request.

What do you think?

Issue Analytics

State:
Created 3 years ago
Comments:10 (6 by maintainers)

Top GitHub Comments

1reaction

Getitdancommented, Jul 26, 2020

Hello @adrianeboyd, thank you for your response. It’s clever, and makes the work done.

Wouldn’t you see any benefit to include this meta behavior as an integrated part of the language class, like in my solution? Same to the as_tuples, and the new option I added: use_tuples_as_texts_meta. Just curious to hear your opinion. Thanks!

0reactions

github-actions[bot]commented, Oct 20, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Top Results From Across the Web

Metadata Life Cycles, Use Cases and Hierarchies

In many cases, the metadata life cycle involves hierarchies where latter phases have increased numbers of items. The relationships between metadata in different ......

Data Life Cycle: describe

Document data by describing the why, who, what, when, where, and how of the data. Metadata, or data about data, are key to...

Data Management | Metadata - ICPSR - University of Michigan

XML permits the markup, or tagging, of documentation content for retrieval and repurposing across the data life cycle. Several tools are available for...

Document Metadata - an overview

Document metadata is metadata stored inside a document that provides information about the authorship, editing time, and even the computer on which the ......

Retrieve the target lifecycle state through instance metadata

The current lifecycle state is the state that the instance is in. These can be the same after the lifecycle action is complete...