Unicode issues with tail plugin
See original GitHub issueI’m using irssi to output log files from channels. I then use the tail plugin to parse entries from these files. The log files are encoded as UTF-8. Until recently this has worked perfectly, but now the plugin aborts with an error “Failed to decode file using utf-8. Check encoding” anytime a Unicode character appears in the log.
When further investigating, I saw that it is caused by an exception thrown in native_str_to_text():
try:
line = native_str_to_text(line, encoding=encoding)
except UnicodeError:
raise plugin.PluginError('Failed to decode file using %s. Check encoding.' % encoding)
if PY2:
def native_str_to_text(string, **kwargs):
if 'encoding' not in kwargs:
kwargs['encoding'] = 'ascii'
return string.decode(**kwargs)
else:
def native_str_to_text(string, **kwargs):
return string
Since I’m running Python 2.7 the native_str_to_text() is basically just calling string.decode(). And because the error message contains our preferred encoding I think we can be fairly certain that we are actually providing the decode-method with the correct encoding string.
I extended the code so that it printed out the full exception and the result was the following output:
'ascii' codec can't encode character u'\u25e2' in position 21: ordinal not in range(128)
The strange thing about all this is that when I cloned the git repository and run the code from there instead of the installed package, everything works. So I think this is some locale/environment trouble. I’ve tried to recreate the virtualenv but it did not have any effect. I’ve also tried to set the LC_ALL, PYTHONIOENCODING, LANG to UTF-8 with no luck.
I was able to recreate the exception in some test code by explicitly setting PYTHONIOENCODING=ascii. So there’s definitely some issues with how python performs decoding in my environment.
However, what did fix the problem was to change the tail.py so that it opens the file in binary mode instead by simply changing.
- with open(filename, 'r') as file:
+ with open(filename, 'rb') as file:
I’m new to Python so I’m not confident this is a good solution because by looking how native_str_to_text() is defined it could cause problems with Python 3 users as I’ve heard that the unicode string management has changed between v2 and v3. An other solution that came in mind was to specify an encoding to the open() method of the file.
Config:
taskX:
tail:
file: ~/.irssi/logfromchannel.log
encoding: utf-8
entry:
title: nick:\s(.*?)\s:\shttp://.*
url: nick:\s.*?\s:\s(http://.*)
format:
url: '%(url)s'
(... and other settings for the task, not related to tail)
Log:
2016-07-06 10:05 CRITICAL plugin taskX Failed to decode file using utf-8. Check encoding.
2016-07-06 10:05 WARNING task taskX Aborting task (plugin: tail)
Additional information:
- Flexget Version: 2.1.6
- Python Version: 2.7.10
- Installation method: Standard installation from released package using virtualenv and pip
- OS and version: openSUSE 13.1 Linux 3.12.57-44-default #1 SMP Wed Apr 6 09:18:15 UTC 2016 (9b4534f) x86_64 x86_64 x86_64 GNU/Linux
Issue Analytics
- State:
- Created 7 years ago
- Comments:7 (5 by maintainers)
Top GitHub Comments
I think the proper fix is this: Switch to
io.open
, which works the same across python versions, and is the default implementation ofopen
on python 3. We should make the default encoding utf-8 instead of ascii, which is almost always a more sane choice. We shouldn’t have to deal with line by line decoding anymore, and I don’t think we ever needed thenative_str_to_text
utility, as the types here should all by consistent across python versions already.Same happens to exec plugin:
BUG: Unhandled error in plugin exec: 'ascii' codec can't encode character u'\xf1' in position 104: ordinal not in range(128)
It was working right recently, but now if the given path contains unicode (e. g. “ñ”) characters it crashes.
Running version 2.1.5 (unable to upgrade to latest due to sqlalchemy update fails) on Windows 8.1 x64 + Python 2.7.10