question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to score text with trained language model

See original GitHub issue

I successfully trained a Transformer language model with fairseq. Now I would like to score text with this model.

This is what I am looking for:

echo "Input text to be scored by lm" | fairseq-score trained_model_path/checkpoint_best.pt
78.23 # example language model perplexity score for this sentence

Alternatively, something like

import torch
from fairseq.models.transformer_lm import TransformerLanguageModel

custom_lm = TransformerLanguageModel.from_pretrained('trained_model_path', 'checkpoint_best.pt')
custom_lm.score('Input text to be scored by lm')
# 78.23 # example language model perplexity score for this sentence

Looking here:

https://github.com/pytorch/fairseq/tree/master/examples/language_model

and here:

https://fairseq.readthedocs.io/en/latest/command_line_tools.html#fairseq-eval-lm

it seems that I have to binarize my test data with fairseq-preprocess, which I want to avoid.

What is the easiest way to score plain text with a trained fairseq LM?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

3reactions
bricksdontcommented, Dec 18, 2019

My solution, without having much insight into torch and fairseq:

import torch
import copy

from fairseq import hub_utils
from fairseq.models.fairseq_model import FairseqLanguageModel


class GeneratorHubInterfaceWithScoring(hub_utils.GeneratorHubInterface):

    def score(self,
              sentence: str,
              verbose: bool = False,
              **kwargs) -> float:

        tokens = sentence.split(" ")
        num_tokens = len(tokens)

        encoded_sentence = self.binarize(sentence)
        sample = self._build_sample(encoded_sentence)

        # build generator using current args as well as any kwargs
        gen_args = copy.copy(self.args)
        gen_args.beam = 1
        gen_args.max_len_b = num_tokens
        for k, v in kwargs.items():
            setattr(gen_args, k, v)
        generator = self.task.build_generator(gen_args)

        translations = self.task.inference_step(generator, self.models, sample)

        hypo = translations[0][0]
        score = hypo['score']

        scored_tokens = hypo['tokens']
        scored_sentence = self.string(scored_tokens)

        assert sentence == scored_sentence, "Input tokens and the ones that are actually scored do not seem identical:\n%s\n%s" % (sentence, scored_sentence)

        if verbose:
            print("TOKENS:\t%s" % scored_tokens)

        return score


class FairseqLanguageModelWithScoring(FairseqLanguageModel):

    @classmethod
    def from_pretrained(cls, model_name_or_path, checkpoint_file='model.pt', data_name_or_path='.', **kwargs):

        x = hub_utils.from_pretrained(
            model_name_or_path,
            checkpoint_file,
            data_name_or_path,
            archive_map=cls.hub_models(),
            **kwargs,
        )
        
        return GeneratorHubInterfaceWithScoring(x['args'], x['task'], x['models'])

Then:

custom_lm = FairseqLanguageModelWithScoring.from_pretrained(args.model_dir, 'checkpoint_best.pt')
2reactions
myleottcommented, Dec 18, 2019

Added a .score function in 9d7725226da3fcd9c5d1ac02473289f53cd7dd78. It should be much faster than using generate.

Usage:

en_lm.score('Barack Obama is coming to Sydney and New Zealand')['positional_scores']
Read more comments on GitHub >

github_iconTop Results From Across the Web

How to score text with trained language model #1259 - GitHub
What is the easiest way to score plain text with a trained fairseq LM?
Read more >
Using Language Models to Create & Understand Text - Anyword
Here's a rundown of how you can understand and generate text using powerful language models.
Read more >
Language modeling - Hugging Face
Causal language models are frequently used for text generation. This section shows you how to finetune DistilGPT2 to generate new text. Train.
Read more >
How to Develop a Word-Level Neural Language Model and ...
First, the Tokenizer must be trained on the entire training dataset, which means it finds all of the unique words in the data...
Read more >
Machine Learning — Text Classification, Language Modelling ...
A key feature of language modelling is that it is generative, meaning that it aims to predict the next word given a previous...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found