[BART] issues on BPE preprocess (examples.roberta.multiprocessing_bpe_encoder)
See original GitHub issueHi, Congratulations on great work!! I appreciate you all for making resources publicly available.
I was following README on finetuning BART on CNNDM task.
While I was performing 2) BPE preprocess
, I faced some problems.
Here are some details of my problem :
- I found that the numbers of lines of
train.bpe.source
andtrain.bpe.target
are not identical. It should be 287227, but there are additional 250 lines while processingtrain.source
.
ubuntu@server:~/fairseq/cnn_dm$ wc -l *
11490 test.source
11490 test.target
287474 train.bpe.source <= not matching
287227 train.bpe.target
287227 train.source
287227 train.target
13368 val.bpe.source
13368 val.bpe.target
13368 val.source
13368 val.target
200000 vocab
1425607 total
- While I was trying to check problem No. 1, I faced another trouble which seems closely related with problem 1.
When I check
val.bpe.target
, the first BPE encoded sentence shows up like following :32 582 287 20154 6182 318 6301 6729 2691 284 4297 287 23254 2585 13 1114 720 4531 11 339 481 4074 718 8059 286 6729 287 281 47869 42378 305 6513 321 3091 13
Usingbart.decode()
, I can decode it and it shows :are pay As spellszi If km wages Women familybut Asolia Con for idea global85 in win free 51il temporarily For wages AsasAlternativelyStage W Fin 0 sites for
. Which should beA man in suburban Boston is selling snow online to customers in warmer states. For $89, he will ship 6 pounds of snow in an insulated Styrofoam box.
The same problem applies to other bpe processed files.
It appears like there is some point I missed. I am checking this on
- Python 3.6
- stanford-corenlp-3.7.0.jar (3.9 also checked)
- pytorch 10.0
- CUDA 10.0
- Ubuntu 16.04
Would you share any thoughts on the matter? It would help me a lot. Once again, thank you very much! WonJin
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:5 (3 by maintainers)
Top Results From Across the Web
[BART] issues on BPE preprocess (examples.roberta ... - GitHub
I was following README on finetuning BART on CNNDM task. While I was performing 2) BPE preprocess, I faced some problems. Here are...
Read more >Alerts and Advisories - BART.gov
BART issues a Service Advisory when multiple trains are off schedule by 10 minutes or more. As a delay initially unfolds, we'll issue...
Read more >BART experienced major delays on Friday 9/23 due to ...
BART riders should expect major delays systemwide through the end of service tonight due to single tracking through the Transbay Tube ...
Read more >A new strategy helps BART to reduce delays in the Transbay ...
During peak commute hours BART is now deploying mainline technicians at both ends of the Tube at our West Oakland and Embarcadero Stations....
Read more >Ridership Reports - BART.gov
Look down Column A to find the station of exit. So, for example, on the typical weekday in July 2010, 399 people entered...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
After checking, I’m also facing problem 2.
(However I don’t meet problem 1)
@Colanim yes, that’s correct. It’s a two stage encoding process. First BPE encode followed by encoding with the fairseq Dictionary.
@wonjininfo, glad you got it working 😃