Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to score text with trained language model

See original GitHub issue

I successfully trained a Transformer language model with fairseq. Now I would like to score text with this model.

This is what I am looking for:

echo "Input text to be scored by lm" | fairseq-score trained_model_path/checkpoint_best.pt
78.23 # example language model perplexity score for this sentence

Alternatively, something like

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel

custom_lm = TransformerLanguageModel.from_pretrained('trained_model_path', 'checkpoint_best.pt')
custom_lm.score('Input text to be scored by lm')
# 78.23 # example language model perplexity score for this sentence

Looking here:

https://github.com/pytorch/fairseq/tree/master/examples/language_model

and here:

https://fairseq.readthedocs.io/en/latest/command_line_tools.html#fairseq-eval-lm

it seems that I have to binarize my test data with fairseq-preprocess, which I want to avoid.

What is the easiest way to score plain text with a trained fairseq LM?

Issue Analytics

State:
Created 4 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

3reactions

bricksdontcommented, Dec 18, 2019

My solution, without having much insight into torch and fairseq:

import torch
import copy

from fairseq import hub_utils
from fairseq.models.fairseq_model import FairseqLanguageModel


class GeneratorHubInterfaceWithScoring(hub_utils.GeneratorHubInterface):

    def score(self,
              sentence: str,
              verbose: bool = False,
              **kwargs) -> float:

        tokens = sentence.split(" ")
        num_tokens = len(tokens)

        encoded_sentence = self.binarize(sentence)
        sample = self._build_sample(encoded_sentence)

        # build generator using current args as well as any kwargs
        gen_args = copy.copy(self.args)
        gen_args.beam = 1
        gen_args.max_len_b = num_tokens
        for k, v in kwargs.items():
            setattr(gen_args, k, v)
        generator = self.task.build_generator(gen_args)

        translations = self.task.inference_step(generator, self.models, sample)

        hypo = translations[0][0]
        score = hypo['score']

        scored_tokens = hypo['tokens']
        scored_sentence = self.string(scored_tokens)

        assert sentence == scored_sentence, "Input tokens and the ones that are actually scored do not seem identical:\n%s\n%s" % (sentence, scored_sentence)

        if verbose:
            print("TOKENS:\t%s" % scored_tokens)

        return score


class FairseqLanguageModelWithScoring(FairseqLanguageModel):

    @classmethod
    def from_pretrained(cls, model_name_or_path, checkpoint_file='model.pt', data_name_or_path='.', **kwargs):

        x = hub_utils.from_pretrained(
            model_name_or_path,
            checkpoint_file,
            data_name_or_path,
            archive_map=cls.hub_models(),
            **kwargs,
        )
        
        return GeneratorHubInterfaceWithScoring(x['args'], x['task'], x['models'])

Then:

custom_lm = FairseqLanguageModelWithScoring.from_pretrained(args.model_dir, 'checkpoint_best.pt')

2reactions

myleottcommented, Dec 18, 2019

Added a .score function in 9d7725226da3fcd9c5d1ac02473289f53cd7dd78. It should be much faster than using generate.

Usage:

en_lm.score('Barack Obama is coming to Sydney and New Zealand')['positional_scores']

Top Results From Across the Web

How to score text with trained language model #1259 - GitHub

What is the easiest way to score plain text with a trained fairseq LM?

Using Language Models to Create & Understand Text - Anyword

Here's a rundown of how you can understand and generate text using powerful language models.

Language modeling - Hugging Face

Causal language models are frequently used for text generation. This section shows you how to finetune DistilGPT2 to generate new text. Train.

How to Develop a Word-Level Neural Language Model and ...

First, the Tokenizer must be trained on the entire training dataset, which means it finds all of the unique words in the data...

Machine Learning — Text Classification, Language Modelling ...

A key feature of language modelling is that it is generative, meaning that it aims to predict the next word given a previous...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

How to score text with trained language model

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Issue with fb_pathmgr

How to train a simple, vanilla transformers translation model from scratch with Fairseq