Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BART] issues on BPE preprocess (examples.roberta.multiprocessing_bpe_encoder)

See original GitHub issue

Hi, Congratulations on great work!! I appreciate you all for making resources publicly available.

I was following README on finetuning BART on CNNDM task.

While I was performing 2) BPE preprocess, I faced some problems.

Here are some details of my problem :

I found that the numbers of lines of train.bpe.source and train.bpe.target are not identical. It should be 287227, but there are additional 250 lines while processing train.source.

ubuntu@server:~/fairseq/cnn_dm$ wc -l *
     11490 test.source
     11490 test.target
    287474 train.bpe.source  <= not matching
    287227 train.bpe.target
    287227 train.source
    287227 train.target
     13368 val.bpe.source
     13368 val.bpe.target
     13368 val.source
     13368 val.target
    200000 vocab
   1425607 total

While I was trying to check problem No. 1, I faced another trouble which seems closely related with problem 1. When I check val.bpe.target, the first BPE encoded sentence shows up like following : 32 582 287 20154 6182 318 6301 6729 2691 284 4297 287 23254 2585 13 1114 720 4531 11 339 481 4074 718 8059 286 6729 287 281 47869 42378 305 6513 321 3091 13 Using bart.decode(), I can decode it and it shows : are pay As spellszi If km wages Women familybut Asolia Con for idea global85 in win free 51il temporarily For wages AsasAlternativelyStage W Fin 0 sites for. Which should be A man in suburban Boston is selling snow online to customers in warmer states. For $89, he will ship 6 pounds of snow in an insulated Styrofoam box. The same problem applies to other bpe processed files.

It appears like there is some point I missed. I am checking this on