question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Regex Entity Extractor

See original GitHub issue

Description of Problem: I think I hear often enough in a forum that people need Regex Entity Extractor (featurizer often doesn’t help when u need reliable exact match extraction) and I’m not an exception here.

example: https://forum.rasa.com/t/unable-to-use-regex-feature/11976/2

Overview of the Solution: I created a custom component to do this and seems to be working good. I wanted to get rasa core developer’s opinion and see if this should be useful/common enough to add as a built-in component. If that’s the case, I will be more than happy to contribute. Here is the component code.

class RegexEntityExtractor(EntityExtractor):
    # This extractor maybe kind of extreme as it takes user's message
    # and return regex match.
    # Confidence will be 1.0 just like Duckling

    provides = ["entities"]

    def __init__(
        self,
        component_config: Optional[Dict[Text, Text]] = None,
        regex_features: Optional[Dict[Text, Any]] = None
    ) -> None:
        super(RegexEntityExtractor, self).__init__(component_config)

        self.regex_feature = regex_features if regex_features else {}

    def train(
        self, training_data: TrainingData, config: RasaNLUModelConfig, **kwargs: Any
    ) -> None:

        self.regex_feature = training_data.regex_features

    @classmethod
    def load(
            cls,
            meta: Dict[Text, Any],
            model_dir: Optional[Text] = None,
            model_metadata: Optional[Metadata] = None,
            cached_component: Optional["RegexEntityExtractor"] = None,
            **kwargs: Any
    ) -> "RegexEntityExtractor":

        file_name = meta.get("file")

        if not file_name:
            regex_features = None
            return cls(meta, regex_features)

        # w/o string cast, mypy will tell me
        # expected "Union[str, _PathLike[str]]"
        regex_pattern_file = os.path.join(str(model_dir), file_name)
        if os.path.isfile(regex_pattern_file):
            regex_features = rasa.utils.io.read_json_file(regex_pattern_file)
        else:
            regex_features = None
            warnings.warn(
                "Failed to load regex pattern file from '{}'".format(regex_pattern_file)
            )
        return cls(meta, regex_features)

    def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
        """Persist this component to disk for future loading."""
        if self.regex_feature:
            file_name = file_name + ".json"
            regex_feature_file = os.path.join(model_dir, file_name)
            write_json_to_file(
                regex_feature_file,
                self.regex_feature, separators=(",", ": "))
            return {"file": file_name}
        else:
            return {"file": None}

    def match_regex(self, message):
        extracted = []
        for d in self.regex_feature:
            match = re.search(pattern=d['pattern'], string=message)
            if match:

                entity = {
                    "start": match.pos,
                    "end": match.endpos,
                    "value": match.group(),
                    "confidence": 1.0,
                    "entity": d['name'],
                }
                extracted.append(entity)

        extracted = self.add_extractor_name(extracted)
        return extracted

    def process(self, message: Message, **kwargs: Any) -> None:
        """Process an incoming message."""
        extracted = self.match_regex(message.text)
        message.set(
            "entities", message.get("entities", []) + extracted, add_to_output=True
        )

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:3
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
naokocommented, Jun 29, 2019

Mostly yes but not like unique pattern like [a-z]{3}a-\d{5} etc. Okay sounds good. Feel free to close if team decided not to proceed. I can continue use as custom component.

More detail analysis of differences are described here: https://medium.com/@naoko.reeves/rasa-regex-entity-extraction-317f047b28b6

1reaction
nishant-roambeecommented, Nov 5, 2019

Description of Problem: I think I hear often enough in a forum that people need Regex Entity Extractor (featurizer often doesn’t help when u need reliable exact match extraction) and I’m not an exception here.

example: https://forum.rasa.com/t/unable-to-use-regex-feature/11976/2

Overview of the Solution: I created a custom component to do this and seems to be working good. I wanted to get rasa core developer’s opinion and see if this should be useful/common enough to add as a built-in component. If that’s the case, I will be more than happy to contribute. Here is the component code.

class RegexEntityExtractor(EntityExtractor):
    # This extractor maybe kind of extreme as it takes user's message
    # and return regex match.
    # Confidence will be 1.0 just like Duckling

    provides = ["entities"]

    def __init__(
        self,
        component_config: Optional[Dict[Text, Text]] = None,
        regex_features: Optional[Dict[Text, Any]] = None
    ) -> None:
        super(RegexEntityExtractor, self).__init__(component_config)

        self.regex_feature = regex_features if regex_features else {}

    def train(
        self, training_data: TrainingData, config: RasaNLUModelConfig, **kwargs: Any
    ) -> None:

        self.regex_feature = training_data.regex_features

    @classmethod
    def load(
            cls,
            meta: Dict[Text, Any],
            model_dir: Optional[Text] = None,
            model_metadata: Optional[Metadata] = None,
            cached_component: Optional["RegexEntityExtractor"] = None,
            **kwargs: Any
    ) -> "RegexEntityExtractor":

        file_name = meta.get("file")

        if not file_name:
            regex_features = None
            return cls(meta, regex_features)

        # w/o string cast, mypy will tell me
        # expected "Union[str, _PathLike[str]]"
        regex_pattern_file = os.path.join(str(model_dir), file_name)
        if os.path.isfile(regex_pattern_file):
            regex_features = rasa.utils.io.read_json_file(regex_pattern_file)
        else:
            regex_features = None
            warnings.warn(
                "Failed to load regex pattern file from '{}'".format(regex_pattern_file)
            )
        return cls(meta, regex_features)

    def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
        """Persist this component to disk for future loading."""
        if self.regex_feature:
            file_name = file_name + ".json"
            regex_feature_file = os.path.join(model_dir, file_name)
            write_json_to_file(
                regex_feature_file,
                self.regex_feature, separators=(",", ": "))
            return {"file": file_name}
        else:
            return {"file": None}

    def match_regex(self, message):
        extracted = []
        for d in self.regex_feature:
            match = re.search(pattern=d['pattern'], string=message)
            if match:

                entity = {
                    "start": match.pos,
                    "end": match.endpos,
                    "value": match.group(),
                    "confidence": 1.0,
                    "entity": d['name'],
                }
                extracted.append(entity)

        extracted = self.add_extractor_name(extracted)
        return extracted

    def process(self, message: Message, **kwargs: Any) -> None:
        """Process an incoming message."""
        extracted = self.match_regex(message.text)
        message.set(
            "entities", message.get("entities", []) + extracted, add_to_output=True
        )

Exactly what was bothering me. What comes with RASA by default doens’t cut it for our use case. We needed to detect UUID v4 and IMEI numbers and some other internal hardware specific identity patterns and I expected it to simply work, but even after providing couple of examples, it detects match to match and fails to generalize. The custom class you provided was just the thing I was thinking of implemented to expand the out of the box regex behavior. Super Thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

RASA Regex Entity Extraction - Medium
RASA Regex Entity Extraction. RASA, an open source ML framework for building contextual AI assistants and chatbots, continues to improve and building a...
Read more >
NLU Training Data - Rasa
You can use regular expressions for rule-based entity extraction using the RegexEntityExtractor component in your NLU pipeline. When using the ...
Read more >
Regex entity extraction - IBM
Regex entity extraction. Extracts new content elements by matching a set of regular expressions against the existing contents of a document.
Read more >
Entity Extraction Using Regex Builder - DocuSign Support
Edit the RegEx.txt file ({install_path}\gate\resources\customerResources) to include any regular expressions required for new entity extraction.
Read more >
Writing a Custom Rasa Entity Extractor for Regular Expressions
Because Rasa currently does not have an accurate Entity Extractor based on Regular Expressions, I wrote one based on @naoko's code.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found