Roundtripping error for unicode ndarrays with char.encode and char.decode with UTF-32LE and UTF-16LE encodings but not their big endian versions
See original GitHub issuenumpy.char.encode
and numpy.char.decode
should effectively be inverses of each other if given the same encoding. However, I have found this not to be the case. It seems that truncating nulls ('\x00'
) from the end of the individual elements of bytes ndarrays when accessing individual elements is causing the problem.
For reference, I am working with numpy 1.11.0 and Python 3.4.2 and 2.7.9 on a 64-bit (64-bit python environments as well) little endian machine (Intel i5) running Debian. Though, I think that the same result might happen on a big endian machine, but I lack a machine or virtual machine to test it (my attempts at getting such a virtual machine running have not worked).
If I take a unicode ndarray, run it through numpy.char.encode
with UTF-32BE or UTF-16BE encoding and then run the result through numpy.char.decode
with the same encoding, I get the original array back. This is the following code in Python 2.7 and Python 3.4. The example here is using UTF-16BE
import numpy as np
encoding = 'UTF-16BE'
a = np.array([[u'abc', u'012', u'ABC\U00010437'], [u'W', u'Q', u'Z']], dtype='U4')
b = np.char.encode(a, encoding)
np.char.decode(b, encoding)
It works, giving the following output
Python 3.4
array([['abc', '012', 'ABC𐐷'],
['W', 'Q', 'Z']],
dtype='<U4')
Python 2.7
array([[u'abc', u'012', u'ABC\U00010437'],
[u'W', u'Q', u'Z']],
dtype='<U4')
But, if I change the encoding to little endian ('UTF-32LE'
or`‘UTF-16LE’``), I get the following errors instead
Python 3.4
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-1-b92715410044> in <module>()
6 b = np.char.encode(a, codec)
7
----> 8 np.char.decode(b, codec)
/home/user/.local/lib/python3.4/site-packages/numpy/core/defchararray.py in decode(a, encoding, errors)
503 """
504 return _to_string_or_unicode_array(
--> 505 _vec_string(a, object_, 'decode', _clean_args(encoding, errors)))
506
507
/usr/lib/python3.4/encodings/utf_16_le.py in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_16_le_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x63 in position 4: truncated data
Python 2.7
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-1-b92715410044> in <module>()
6 b = np.char.encode(a, codec)
7
----> 8 np.char.decode(b, codec)
/home/user/.local/lib/python2.7/site-packages/numpy/core/defchararray.pyc in decode(a, encoding, errors)
503 """
504 return _to_string_or_unicode_array(
--> 505 _vec_string(a, object_, 'decode', _clean_args(encoding, errors)))
506
507
/usr/lib/python2.7/encodings/utf_16_le.pyc in decode(input, errors)
14
15 def decode(input, errors='strict'):
---> 16 return codecs.utf_16_le_decode(input, errors, True)
17
18 class IncrementalEncoder(codecs.IncrementalEncoder):
UnicodeDecodeError: 'utf16' codec can't decode byte 0x63 in position 4: truncated data
If I use the little endian encoding ('UTF-32LE'
or 'UTF-16LE'
) and look at a
and b
closer, the reason for the error becomes more apparent. The dtypes of a
and b
are '<U4'
and 'S10'
respectively. But if I look at the dtypes of element [1, 0]
, I get '<U1'
and 'S1'
respectively.
The trailing null characters ('\x00'
) are dropped when accessing the individual elements. When trying decode from 'UTF-16LE'
or 'UTF-16LE'
, numpy.char.decode
is grabbing data with an odd number of bytes (dropping the required trailing nulls) and trying to decode it, which can’t work because the encoding works with quads or pairs of bytes (thus a number divisible by 4 or 2).
This indicates that neither numpy.char.decode
nor the decode function that numpy.char.decode
passes off to elementwise is padding the ends of the individual string elements with the required null bytes.
This only shows up in little endian encodings since that results in trailing nulls. Big endian encodings don’t have that problem.
Issue Analytics
- State:
- Created 7 years ago
- Comments:15 (11 by maintainers)
Top GitHub Comments
I have developed a work around. Essentially, to decode back after doing a little endian encoding, one obtains a uint16/32 view, byteswaps it, obtains a view of that using the dtype of the encoded array, and then decodes it with the big endian version of the encoding.
If you run
a == d
, one gets@frejanordsiek could you suggest a sentence or two? A pull request against the docs would be great, but even a draft something here would be nice