Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pre-trained embeddings not used as feature for CRFEntityExtractor

See original GitHub issue

Rasa version: 2.7.1

Rasa SDK version (if used & relevant): 2.7.0

Rasa X version (if used & relevant):

Python version: 3.8.8

Operating system (windows, osx, …): Windows-10-10.0.19041-SP0

Issue: In the docs for CRFEntityExtractor component, it says:

If you want to pass custom features, such as pre-trained word embeddings, to CRFEntityExtractor, you can add any dense featurizer to the pipeline before the CRFEntityExtractor. CRFEntityExtractor automatically finds the additional dense features and checks if the dense features are an iterable of len(tokens), where each entry is a vector. A warning will be shown in case the check fails.

However, I get identical results when using different language models, or even no language model at all. I’m using Rasa NLU only for a simple entity extraction task. This leads me to think that the pre-trained embeddings are not getting passed on to the CRFEntityExtractor, despite LanguageModelFeaturizer generating dense features and no warnings indicating that the pretrained embeddings are not passed.

For example, when training a CRFEntityExtractor using config1/2/3 on the same train data and testing also on the same test set, I get identical precision/recall/f1 results.

Error (including full traceback):

Command or request that led to error:

rasa train nlu -c config1 (or 2 or 3)
rasa test nlu -c config1  (or 2 or 3)

Content of configuration file (config.yml) (if relevant):

Config 1

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 2

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 3

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_name: "distilbert"
    model_weights: "distilbert-base-uncased"
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Content of domain file (domain.yml) (if relevant):

Content of train data I am just using a few utterances from the SNIPS dataset. Here’s a small example of my train data.

version: '2.0'

nlu:
- intent: General
  examples: |
    - find a [restaurant](restaurant_type) for [marylou and i](party_size_description) [within walking distance](spatial_relation) of [my mum s hotel](poi)
    - book a table at a [bar](restaurant_type) in [cambodia](country) that serves [cheese fries](served_dish)
    - i m in [bowling green](poi) please book a [restaurant](restaurant_type) for [1](party_size_number) [close by](spatial_relation)
    - book a [restaurant](restaurant_type) at a [steakhouse](restaurant_type) [around](spatial_relation) [in town](poi) that serves [empanada](served_dish) for [me and my son](party_size_description)
    - book me a table for [me and my nephew](party_size_description) [near](spatial_relation) [my location](poi) at an [indoor](facility) [pub](restaurant_type)
    - book a table for [me and belinda](party_size_description) serving [minestra](served_dish) in a [bar](restaurant_type)
    - i need seating for [ten](party_size_number) people at a [bar](restaurant_type) that serves [czech](cuisine) cuisine
    - book a spot for [connie earline and rose](party_size_description) at an [oyster bar](restaurant_type) that serves [chicken fried bacon](served_dish) in [beauregard](city) [delaware](state)
    - reserve a table for [two](party_size_number) at a [restaurant](restaurant_type) which serves [creole](cuisine) [around](spatial_relation) here in [myanmar](country)
    - take me a [top-rated](sort) [restaurant](restaurant_type) for [nine](party_size_number) [close](spatial_relation) to [westfield](city) [delaware](state)
    - book a [joint restaurant](restaurant_type) for [four](party_size_number) with an [outdoor](facility) [area within the same area](spatial_relation) as [borough de denali](poi)
    - make reservations for [7](party_size_number) people at a [top-rated](sort) [brazilian](cuisine) [pub](restaurant_type) [around](spatial_relation) [rockaway park-beach 116th](poi)
    - need to book a table [downtown](poi) [within walking distance](spatial_relation) of me at [j g melon](restaurant_name)

Definition of done

Determine if this is only a documentation issue by looking through 1.4, 1.5, and 2.x + asking the research team
If so, then we should update the docs ~and add warnings~
~Otherwise, create another issue for addressing this bug~
Reviewed by @koernerfelicia

Issue Analytics

State:
Created 2 years ago
Comments:15 (12 by maintainers)

Top GitHub Comments

1reaction

dakshvar22commented, Sep 6, 2021

@tttthomasssss I am sure it’s accidental that the documentation lacks information on how to use dense features. We should add it if it’s not already there.

1reaction

tttthomassssscommented, Aug 31, 2021

I had another poke at the issue and it is possible to make CRFEntityExtractor use dense embeddings. The 3 configs below all give different results. It looks like its more of a documentation issue now as its not documented how to configure CRFEntityExtractor to use dense features (so the code in question from the above comment is indeed very much in use). I am also not sure whether this is an intended or an accidental feature (@TyDunn or @dakshvar22 might know more?).

Config 1:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor
    "features": [["text_dense_features"]]

Config 2:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 3:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor
    "features": [["text_dense_features"]]