Regex Entity Extractor
See original GitHub issueDescription of Problem: I think I hear often enough in a forum that people need Regex Entity Extractor (featurizer often doesn’t help when u need reliable exact match extraction) and I’m not an exception here.
example: https://forum.rasa.com/t/unable-to-use-regex-feature/11976/2
Overview of the Solution: I created a custom component to do this and seems to be working good. I wanted to get rasa core developer’s opinion and see if this should be useful/common enough to add as a built-in component. If that’s the case, I will be more than happy to contribute. Here is the component code.
class RegexEntityExtractor(EntityExtractor):
# This extractor maybe kind of extreme as it takes user's message
# and return regex match.
# Confidence will be 1.0 just like Duckling
provides = ["entities"]
def __init__(
self,
component_config: Optional[Dict[Text, Text]] = None,
regex_features: Optional[Dict[Text, Any]] = None
) -> None:
super(RegexEntityExtractor, self).__init__(component_config)
self.regex_feature = regex_features if regex_features else {}
def train(
self, training_data: TrainingData, config: RasaNLUModelConfig, **kwargs: Any
) -> None:
self.regex_feature = training_data.regex_features
@classmethod
def load(
cls,
meta: Dict[Text, Any],
model_dir: Optional[Text] = None,
model_metadata: Optional[Metadata] = None,
cached_component: Optional["RegexEntityExtractor"] = None,
**kwargs: Any
) -> "RegexEntityExtractor":
file_name = meta.get("file")
if not file_name:
regex_features = None
return cls(meta, regex_features)
# w/o string cast, mypy will tell me
# expected "Union[str, _PathLike[str]]"
regex_pattern_file = os.path.join(str(model_dir), file_name)
if os.path.isfile(regex_pattern_file):
regex_features = rasa.utils.io.read_json_file(regex_pattern_file)
else:
regex_features = None
warnings.warn(
"Failed to load regex pattern file from '{}'".format(regex_pattern_file)
)
return cls(meta, regex_features)
def persist(self, file_name: Text, model_dir: Text) -> Optional[Dict[Text, Any]]:
"""Persist this component to disk for future loading."""
if self.regex_feature:
file_name = file_name + ".json"
regex_feature_file = os.path.join(model_dir, file_name)
write_json_to_file(
regex_feature_file,
self.regex_feature, separators=(",", ": "))
return {"file": file_name}
else:
return {"file": None}
def match_regex(self, message):
extracted = []
for d in self.regex_feature:
match = re.search(pattern=d['pattern'], string=message)
if match:
entity = {
"start": match.pos,
"end": match.endpos,
"value": match.group(),
"confidence": 1.0,
"entity": d['name'],
}
extracted.append(entity)
extracted = self.add_extractor_name(extracted)
return extracted
def process(self, message: Message, **kwargs: Any) -> None:
"""Process an incoming message."""
extracted = self.match_regex(message.text)
message.set(
"entities", message.get("entities", []) + extracted, add_to_output=True
)
Issue Analytics
- State:
- Created 4 years ago
- Reactions:3
- Comments:7 (5 by maintainers)
Top Results From Across the Web
RASA Regex Entity Extraction - Medium
RASA Regex Entity Extraction. RASA, an open source ML framework for building contextual AI assistants and chatbots, continues to improve and building a...
Read more >NLU Training Data - Rasa
You can use regular expressions for rule-based entity extraction using the RegexEntityExtractor component in your NLU pipeline. When using the ...
Read more >Regex entity extraction - IBM
Regex entity extraction. Extracts new content elements by matching a set of regular expressions against the existing contents of a document.
Read more >Entity Extraction Using Regex Builder - DocuSign Support
Edit the RegEx.txt file ({install_path}\gate\resources\customerResources) to include any regular expressions required for new entity extraction.
Read more >Writing a Custom Rasa Entity Extractor for Regular Expressions
Because Rasa currently does not have an accurate Entity Extractor based on Regular Expressions, I wrote one based on @naoko's code.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Mostly yes but not like unique pattern like [a-z]{3}a-\d{5} etc. Okay sounds good. Feel free to close if team decided not to proceed. I can continue use as custom component.
More detail analysis of differences are described here: https://medium.com/@naoko.reeves/rasa-regex-entity-extraction-317f047b28b6
Exactly what was bothering me. What comes with RASA by default doens’t cut it for our use case. We needed to detect UUID v4 and IMEI numbers and some other internal hardware specific identity patterns and I expected it to simply work, but even after providing couple of examples, it detects match to match and fails to generalize. The custom class you provided was just the thing I was thinking of implemented to expand the out of the box regex behavior. Super Thanks.