question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pre-trained embeddings not used as feature for CRFEntityExtractor

See original GitHub issue

Rasa version: 2.7.1

Rasa SDK version (if used & relevant): 2.7.0

Rasa X version (if used & relevant):

Python version: 3.8.8

Operating system (windows, osx, …): Windows-10-10.0.19041-SP0

Issue: In the docs for CRFEntityExtractor component, it says:

If you want to pass custom features, such as pre-trained word embeddings, to CRFEntityExtractor, you can add any dense featurizer to the pipeline before the CRFEntityExtractor. CRFEntityExtractor automatically finds the additional dense features and checks if the dense features are an iterable of len(tokens), where each entry is a vector. A warning will be shown in case the check fails.

However, I get identical results when using different language models, or even no language model at all. I’m using Rasa NLU only for a simple entity extraction task. This leads me to think that the pre-trained embeddings are not getting passed on to the CRFEntityExtractor, despite LanguageModelFeaturizer generating dense features and no warnings indicating that the pretrained embeddings are not passed.

For example, when training a CRFEntityExtractor using config1/2/3 on the same train data and testing also on the same test set, I get identical precision/recall/f1 results.

Error (including full traceback):

Command or request that led to error:

rasa train nlu -c config1 (or 2 or 3)
rasa test nlu -c config1  (or 2 or 3)

Content of configuration file (config.yml) (if relevant):

Config 1

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 2

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 3

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_name: "distilbert"
    model_weights: "distilbert-base-uncased"
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Content of domain file (domain.yml) (if relevant):


Content of train data I am just using a few utterances from the SNIPS dataset. Here’s a small example of my train data.

version: '2.0'

nlu:
- intent: General
  examples: |
    - find a [restaurant](restaurant_type) for [marylou and i](party_size_description) [within walking distance](spatial_relation) of [my mum s hotel](poi)
    - book a table at a [bar](restaurant_type) in [cambodia](country) that serves [cheese fries](served_dish)
    - i m in [bowling green](poi) please book a [restaurant](restaurant_type) for [1](party_size_number) [close by](spatial_relation)
    - book a [restaurant](restaurant_type) at a [steakhouse](restaurant_type) [around](spatial_relation) [in town](poi) that serves [empanada](served_dish) for [me and my son](party_size_description)
    - book me a table for [me and my nephew](party_size_description) [near](spatial_relation) [my location](poi) at an [indoor](facility) [pub](restaurant_type)
    - book a table for [me and belinda](party_size_description) serving [minestra](served_dish) in a [bar](restaurant_type)
    - i need seating for [ten](party_size_number) people at a [bar](restaurant_type) that serves [czech](cuisine) cuisine
    - book a spot for [connie earline and rose](party_size_description) at an [oyster bar](restaurant_type) that serves [chicken fried bacon](served_dish) in [beauregard](city) [delaware](state)
    - reserve a table for [two](party_size_number) at a [restaurant](restaurant_type) which serves [creole](cuisine) [around](spatial_relation) here in [myanmar](country)
    - take me a [top-rated](sort) [restaurant](restaurant_type) for [nine](party_size_number) [close](spatial_relation) to [westfield](city) [delaware](state)
    - book a [joint restaurant](restaurant_type) for [four](party_size_number) with an [outdoor](facility) [area within the same area](spatial_relation) as [borough de denali](poi)
    - make reservations for [7](party_size_number) people at a [top-rated](sort) [brazilian](cuisine) [pub](restaurant_type) [around](spatial_relation) [rockaway park-beach 116th](poi)
    - need to book a table [downtown](poi) [within walking distance](spatial_relation) of me at [j g melon](restaurant_name)

Definition of done

  • Determine if this is only a documentation issue by looking through 1.4, 1.5, and 2.x + asking the research team
  • If so, then we should update the docs ~and add warnings~
  • ~Otherwise, create another issue for addressing this bug~
  • Reviewed by @koernerfelicia

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:15 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
dakshvar22commented, Sep 6, 2021

@tttthomasssss I am sure it’s accidental that the documentation lacks information on how to use dense features. We should add it if it’s not already there.

1reaction
tttthomassssscommented, Aug 31, 2021

I had another poke at the issue and it is possible to make CRFEntityExtractor use dense embeddings. The 3 configs below all give different results. It looks like its more of a documentation issue now as its not documented how to configure CRFEntityExtractor to use dense features (so the code in question from the above comment is indeed very much in use). I am also not sure whether this is an intended or an accidental feature (@TyDunn or @dakshvar22 might know more?).

Config 1:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor
    "features": [["text_dense_features"]]

Config 2:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 3:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor
    "features": [["text_dense_features"]]
Read more comments on GitHub >

github_iconTop Results From Across the Web

Using NER as a Feature for CRFEntityExtractor
My idea is to use a pre-trained NER extractor, e.g. from SpaCy to first extract New York as a city. Then, combine it...
Read more >
Guide to Using Pre-trained Word Embeddings in NLP
In this article, we'll take a look at how you can use pre-trained word embeddings to classify text with TensorFlow. Full code included....
Read more >
Better Intent Classification And Entity Extraction with ... - Botfront
You can use pre-trained embeddings, features from your oww training data, or both. As a reminder, at training time, the classifier will learn...
Read more >
Rasa `RegexEntityExtractor` extracting non entities as entities
I was trying to extract user names from input. This is the related training data (I have not provided the ...
Read more >
Pre-trained Word Embeddings or Embedding Layer?
Today, we can create our corpus-specific word embeddings through efficient tools such as fastText in no time. We can also use an embedding...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found