Problem with handling encoding failure
See original GitHub issueI noticed that the method _convert_word_to_char_ids found in bilm/data.py can’t handle encoding errors under certain conditions. The problem is in the code chunk below:
word_encoded = word.encode('utf-8', 'ignore')[:(self.max_word_length-2)]
code[0] = self.bow_char
for k, chr_id in enumerate(word_encoded, start=1):
code[k] = chr_id
code[k + 1] = self.eow_char
As you can see, if a token consisted in a single character that failed to encode, then the word_encoded variable is going to be an empty string. When this goes into the enumerate for-loop, it exists without initializing the k variable and therefore the last line fails with the following error:
UnboundLocalError: local variable 'k' referenced before assignment
This can be handled with an exception, which could flag the failed token and print a warning. Since I haven’t gone deep into the specifics of the library, I am not sure if this is a proper solution, so I thought I might as well bring this to your attention.
EDIT:
Another thing I have noticed is that empty files in the training data folder would cause the training to fail, once processed; meaning the training could go on for days, only to fail on an empty file. So just to save users the trouble, it would be very kind of you to notify them that empty files will cause a problem, or may be add some logic to safely skip such failures.
Issue Analytics
- State:
- Created 5 years ago
- Comments:6
Top GitHub Comments
@qujinqiang This blog may be helpful for you. http://www.linzehui.me/2018/08/12/碎片知识点/如何将ELMo词向量用于中文/
@FynnYoung thanks !