Output of the decode seems perplexing
See original GitHub issueWhen running the decode, I expected the output to resemble something close to target in the input file. I’ve created iput.json as shown in the README.md in s2s-ft:
{"src": "Messages posted on social media claimed the user planned to `` kill as many people as possible ''", "tgt": "Threats to kill pupils in a shooting at a Blackpool school are being investigated by Lancashire police ."}
{"src": "Media playback is unsupported on your device", "tgt": "A slide running the entire length of one of the steepest city centre streets in Europe has been turned into a massive three-lane water adventure ."}
{"src": "Chris Erskine crossed low for Kris Doolan to tap home and give the Jags an early lead .", "tgt": "Partick Thistle will finish in the Scottish Premiership 's top six for the first time after beating Motherwell"}
I’ve ran following inside unzipped pre-trained model folder so that file names are as expected:
mv minilm-l12-h384-uncased-config.json config.json
mv minilm-l12-h384-uncased.bin pytorch_model.bin
Also specified map_location="cpu"
to torch.load on modeling_decoding.py:784
state_dict = torch.load(weights_path, map_location='cpu')
And I ran the command. Note that I removed --fp16
because I am running on CPU
MODEL_PATH=MiniLM-L12-H384-uncased
VOCAB_PATH=MiniLM-L12-H384-uncased
SPLIT=validation
INPUT_JSON=input.json
export CUDA_VISIBLE_DEVICES=0
export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
python s2s-ft/decode_seq2seq.py \
--model_type minilm --tokenizer_name minilm-l12-h384-uncased \
--input_file ${INPUT_JSON} --split dev --do_lower_case \
--model_path ${MODEL_PATH} --max_seq_length 512 --max_tgt_length 48 --batch_size 32 --beam_size 5 \
--length_penalty 0 --forbid_duplicate_ngrams --mode s2s --forbid_ignore_word "." --need_score_traces
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex. 05/02/2020 23:43:03 - INFO - transformers.tokenization_utils - loading file https://unilm.blob.core.windows.net/ckpt/minilm-l12-h384-uncased-vocab.txt from cache at /home/john/.cache/torch/transformers/c6a0d170b6fcc6d023a402d9c81e5526a82 901ffed3eb6021fb0ec17cfd24711.0af242a3765cd96e2c6ad669a38c22d99d583824740a9a2b36fe3ed5a07d0503 05/02/2020 23:43:03 - INFO - main - Read decoding config from: MiniLM-L12-H384-uncased/config.json MiniLM-L12-H384-uncased 05/02/2020 23:43:03 - INFO - main - ***** Recover model: MiniLM-L12-H384-uncased ***** 05/02/2020 23:43:03 - INFO - s2s_ft.modeling_decoding - Model config { “attention_probs_dropout_prob”: 0.1, “ffn_type”: 0, “fp32_embedding”: false, “hidden_act”: “gelu”, “hidden_dropout_prob”: 0.1, “hidden_size”: 384, “initializer_range”: 0.02, “intermediate_size”: 1536, “label_smoothing”: null, “max_position_embeddings”: 512, “new_pos_ids”: false, “no_segment_embedding”: false, “num_attention_heads”: 12, “num_hidden_layers”: 12, “num_qkv”: 0, “relax_projection”: 0, “seg_emb”: false, “source_type_id”: 0, “target_type_id”: 1, “task_idx”: null, “type_vocab_size”: 2, “vocab_size”: 30522 }
05/02/2020 23:43:04 - INFO - s2s_ft.utils - Creating features from dataset file at input.json
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1357.53it/s]
0%| | 0/1 [00:00<?, ?it/s]
05/02/2020 23:43:04 - INFO - s2s_ft.s2s_loader - Input src = [CLS] messages posted on social media claimed the user planned to
kill as many people as possible ’ ’ [SEP]
05/02/2020 23:43:04 - INFO - s2s_ft.s2s_loader - Input src = [CLS] chris erskine crossed low for kris doo ##lan to tap home and give the ja ##gs an early lead . [SEP]
05/02/2020 23:43:04 - INFO - s2s_ft.s2s_loader - Input src = [CLS] media playback is un ##su ##pp ##orted on your device [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
05/02/2020 23:43:12 - INFO - main - 0 = scroll scroll scroll hedge scroll logic logic inclined logic thoughts sides table table table punt table punt punt table dive punt dive punt punt punt self self self financial self self investo
r self selfiti hedge self self sentiment hedge relations hedge friends self metro metroulul
05/02/2020 23:43:12 - INFO - main - 2 = operationsace sides sides sidesaceaceace sides header header headerhab header headerchai header headersus headerchaiitiitiaceace hedgeves financial self selface analysts admitted admitted admit
teditihelace hedge self self self himself himself self metro self metro
05/02/2020 23:43:12 - INFO - main - 1 = scroll scroll shouldn plymouth scroll scrollgamgam scrollgam scroll scroll scroll briefs flourish ground volleyball should shouldnow logic logic tacticsaceaceace thought mirdehel profile profil
e counterhelhel profile bar portfolio counter portfolio portfolio portfolio def portfolio portfolio hedge portfolio portfolio
See the last few lines of the output, it is gibberish that doesn’t look anything like a sentence. Is there a parameter that I missed or specified erroneously? When I tried Unilm-v1 way back, the input format was different, but the output was decent with some attributes of summarization.
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Hi @johnyoonh,
It seemed that the MODEL_PATH is not set correctly. Please change the MODEL_PATH to first/ckpt-108000.
Thanks
I fine-tuned on AWS g4dn.x12large (T4 x4) on conda for 17 hours.
I did not end up using docker as I was facing some issues, so instead, I installed packages used by
pytorch/pytorch:1.2-cuda10.0-cudnn7-devel
imageconda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
This is the output I got: