Pre-trained embeddings not used as feature for CRFEntityExtractor
See original GitHub issueRasa version: 2.7.1
Rasa SDK version (if used & relevant): 2.7.0
Rasa X version (if used & relevant):
Python version: 3.8.8
Operating system (windows, osx, …): Windows-10-10.0.19041-SP0
Issue: In the docs for CRFEntityExtractor component, it says:
If you want to pass custom features, such as pre-trained word embeddings, to CRFEntityExtractor, you can add any dense featurizer to the pipeline before the CRFEntityExtractor. CRFEntityExtractor automatically finds the additional dense features and checks if the dense features are an iterable of len(tokens), where each entry is a vector. A warning will be shown in case the check fails.
However, I get identical results when using different language models, or even no language model at all. I’m using Rasa NLU only for a simple entity extraction task. This leads me to think that the pre-trained embeddings are not getting passed on to the CRFEntityExtractor, despite LanguageModelFeaturizer generating dense features and no warnings indicating that the pretrained embeddings are not passed.
For example, when training a CRFEntityExtractor using config1/2/3 on the same train data and testing also on the same test set, I get identical precision/recall/f1 results.
Error (including full traceback):
Command or request that led to error:
rasa train nlu -c config1 (or 2 or 3)
rasa test nlu -c config1 (or 2 or 3)
Content of configuration file (config.yml) (if relevant):
Config 1
language: en
pipeline:
- name: LanguageModelTokenizer
- name: LexicalSyntacticFeaturizer
"features": [
# features for the word preceding the word being evaluated
[ "suffix2", "prefix2" ],
# features for the word being evaluated
[ "BOS", "EOS" ],
# features for the word following the word being evaluated
[ "suffix2", "prefix2" ]]
- name: CRFEntityExtractor
Config 2
language: en
pipeline:
- name: LanguageModelTokenizer
- name: LanguageModelFeaturizer
model_name: "roberta"
model_weights: "roberta-base"
- name: LexicalSyntacticFeaturizer
"features": [
# features for the word preceding the word being evaluated
[ "suffix2", "prefix2" ],
# features for the word being evaluated
[ "BOS", "EOS" ],
# features for the word following the word being evaluated
[ "suffix2", "prefix2" ]]
- name: CRFEntityExtractor
Config 3
language: en
pipeline:
- name: LanguageModelTokenizer
- name: LanguageModelFeaturizer
model_name: "distilbert"
model_weights: "distilbert-base-uncased"
- name: LexicalSyntacticFeaturizer
"features": [
# features for the word preceding the word being evaluated
[ "suffix2", "prefix2" ],
# features for the word being evaluated
[ "BOS", "EOS" ],
# features for the word following the word being evaluated
[ "suffix2", "prefix2" ]]
- name: CRFEntityExtractor
Content of domain file (domain.yml) (if relevant):
Content of train data I am just using a few utterances from the SNIPS dataset. Here’s a small example of my train data.
version: '2.0'
nlu:
- intent: General
examples: |
- find a [restaurant](restaurant_type) for [marylou and i](party_size_description) [within walking distance](spatial_relation) of [my mum s hotel](poi)
- book a table at a [bar](restaurant_type) in [cambodia](country) that serves [cheese fries](served_dish)
- i m in [bowling green](poi) please book a [restaurant](restaurant_type) for [1](party_size_number) [close by](spatial_relation)
- book a [restaurant](restaurant_type) at a [steakhouse](restaurant_type) [around](spatial_relation) [in town](poi) that serves [empanada](served_dish) for [me and my son](party_size_description)
- book me a table for [me and my nephew](party_size_description) [near](spatial_relation) [my location](poi) at an [indoor](facility) [pub](restaurant_type)
- book a table for [me and belinda](party_size_description) serving [minestra](served_dish) in a [bar](restaurant_type)
- i need seating for [ten](party_size_number) people at a [bar](restaurant_type) that serves [czech](cuisine) cuisine
- book a spot for [connie earline and rose](party_size_description) at an [oyster bar](restaurant_type) that serves [chicken fried bacon](served_dish) in [beauregard](city) [delaware](state)
- reserve a table for [two](party_size_number) at a [restaurant](restaurant_type) which serves [creole](cuisine) [around](spatial_relation) here in [myanmar](country)
- take me a [top-rated](sort) [restaurant](restaurant_type) for [nine](party_size_number) [close](spatial_relation) to [westfield](city) [delaware](state)
- book a [joint restaurant](restaurant_type) for [four](party_size_number) with an [outdoor](facility) [area within the same area](spatial_relation) as [borough de denali](poi)
- make reservations for [7](party_size_number) people at a [top-rated](sort) [brazilian](cuisine) [pub](restaurant_type) [around](spatial_relation) [rockaway park-beach 116th](poi)
- need to book a table [downtown](poi) [within walking distance](spatial_relation) of me at [j g melon](restaurant_name)
Definition of done
- Determine if this is only a documentation issue by looking through 1.4, 1.5, and 2.x + asking the research team
- If so, then we should update the docs ~and add warnings~
- ~Otherwise, create another issue for addressing this bug~
- Reviewed by @koernerfelicia
Issue Analytics
- State:
- Created 2 years ago
- Comments:15 (12 by maintainers)
@tttthomasssss I am sure it’s accidental that the documentation lacks information on how to use dense features. We should add it if it’s not already there.
I had another poke at the issue and it is possible to make
CRFEntityExtractor
use dense embeddings. The 3 configs below all give different results. It looks like its more of a documentation issue now as its not documented how to configureCRFEntityExtractor
to use dense features (so the code in question from the above comment is indeed very much in use). I am also not sure whether this is an intended or an accidental feature (@TyDunn or @dakshvar22 might know more?).Config 1:
Config 2:
Config 3: