should PegasusTokenizer replace `/n` with `<n>`?
See original GitHub issue
Top Results From Across the Web
Pegasus - Hugging Face
Construct a “fast” PEGASUS tokenizer (backed by HuggingFace's tokenizers library). Based on Unigram. This tokenizer inherits from PreTrainedTokenizerFast which ...
Read more >NLP Data Augmentation
Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its...
Read more >Quick Tip: How to capture and replace newlines in liquid for ...
Turns out, the main trick was to capture a newline in liquid and assemble the string to replace using that one. {% capture...
Read more >A Survey of Recent Abstract Summarization Techniques
We will explore its implementation on T5, Pegasus, and ProphetNet. ... ProphetNet recovers the next n future tokens within each masked token span....
Read more >Paper_summary.ipynb - Colaboratory - Google Colab
Off | 00000000:00:04.0 Off | 0 | | N/A 33C P0 23W / 300W | 0MiB / 16160MiB | 0% ... tokenizer =...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
(1) just pegasus (2) It doesn’t handle the situation – it leaves
<n>
in the output and trustscalculate_rouge_score
(which callsadd_newline_to_end_of_each_sentence
) to temporarily remove<n>
and then add\n
between sentences, thereby computingrougeLsum
correctly. This happens after results are saved, and therefore generations still have<n>
. https://github.com/huggingface/transformers/blob/dc552b9b7025ea9c38717f30ad3d69c2a972049d/examples/seq2seq/sentence_splitter.py#L18Your first line is composed of the characters
\
andn
and not the actual character customarily represented by\n
which ishex(0a)
(ascii code = 10), no?