question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Roundtripping error for unicode ndarrays with char.encode and char.decode with UTF-32LE and UTF-16LE encodings but not their big endian versions

See original GitHub issue

numpy.char.encode and numpy.char.decode should effectively be inverses of each other if given the same encoding. However, I have found this not to be the case. It seems that truncating nulls ('\x00') from the end of the individual elements of bytes ndarrays when accessing individual elements is causing the problem.

For reference, I am working with numpy 1.11.0 and Python 3.4.2 and 2.7.9 on a 64-bit (64-bit python environments as well) little endian machine (Intel i5) running Debian. Though, I think that the same result might happen on a big endian machine, but I lack a machine or virtual machine to test it (my attempts at getting such a virtual machine running have not worked).

If I take a unicode ndarray, run it through numpy.char.encode with UTF-32BE or UTF-16BE encoding and then run the result through numpy.char.decode with the same encoding, I get the original array back. This is the following code in Python 2.7 and Python 3.4. The example here is using UTF-16BE

import numpy as np

encoding = 'UTF-16BE'

a = np.array([[u'abc', u'012', u'ABC\U00010437'], [u'W', u'Q', u'Z']], dtype='U4')
b = np.char.encode(a, encoding)

np.char.decode(b, encoding)

It works, giving the following output

Python 3.4

array([['abc', '012', 'ABC𐐷'],
       ['W', 'Q', 'Z']], 
      dtype='<U4')

Python 2.7

array([[u'abc', u'012', u'ABC\U00010437'],
       [u'W', u'Q', u'Z']], 
      dtype='<U4')

But, if I change the encoding to little endian ('UTF-32LE' or`‘UTF-16LE’``), I get the following errors instead

Python 3.4

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-1-b92715410044> in <module>()
      6 b = np.char.encode(a, codec)
      7 
----> 8 np.char.decode(b, codec)

/home/user/.local/lib/python3.4/site-packages/numpy/core/defchararray.py in decode(a, encoding, errors)
    503     """
    504     return _to_string_or_unicode_array(
--> 505         _vec_string(a, object_, 'decode', _clean_args(encoding, errors)))
    506 
    507 

/usr/lib/python3.4/encodings/utf_16_le.py in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_16_le_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x63 in position 4: truncated data

Python 2.7

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-1-b92715410044> in <module>()
      6 b = np.char.encode(a, codec)
      7 
----> 8 np.char.decode(b, codec)

/home/user/.local/lib/python2.7/site-packages/numpy/core/defchararray.pyc in decode(a, encoding, errors)
    503     """
    504     return _to_string_or_unicode_array(
--> 505         _vec_string(a, object_, 'decode', _clean_args(encoding, errors)))
    506 
    507 

/usr/lib/python2.7/encodings/utf_16_le.pyc in decode(input, errors)
     14 
     15 def decode(input, errors='strict'):
---> 16     return codecs.utf_16_le_decode(input, errors, True)
     17 
     18 class IncrementalEncoder(codecs.IncrementalEncoder):

UnicodeDecodeError: 'utf16' codec can't decode byte 0x63 in position 4: truncated data

If I use the little endian encoding ('UTF-32LE' or 'UTF-16LE') and look at a and b closer, the reason for the error becomes more apparent. The dtypes of a and b are '<U4' and 'S10' respectively. But if I look at the dtypes of element [1, 0], I get '<U1' and 'S1' respectively.

The trailing null characters ('\x00') are dropped when accessing the individual elements. When trying decode from 'UTF-16LE' or 'UTF-16LE', numpy.char.decode is grabbing data with an odd number of bytes (dropping the required trailing nulls) and trying to decode it, which can’t work because the encoding works with quads or pairs of bytes (thus a number divisible by 4 or 2).

This indicates that neither numpy.char.decode nor the decode function that numpy.char.decode passes off to elementwise is padding the ends of the individual string elements with the required null bytes.

This only shows up in little endian encodings since that results in trailing nulls. Big endian encodings don’t have that problem.

Issue Analytics

  • State:open
  • Created 7 years ago
  • Comments:15 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
frejanordsiekcommented, May 8, 2016

I have developed a work around. Essentially, to decode back after doing a little endian encoding, one obtains a uint16/32 view, byteswaps it, obtains a view of that using the dtype of the encoded array, and then decodes it with the big endian version of the encoding.

import numpy as np

encoding = 'UTF-16LE'

a = np.array([[u'abc', u'012', u'ABC\U00010437'], [u'W', u'Q', u'Z']], dtype='U4')
b = np.char.encode(a, encoding)

c = b.view(np.uint16).byteswap().view(b.dtype)

d = np.char.decode(c, 'UTF-16BE')

If you run a == d, one gets

array([[ True,  True,  True],
       [ True,  True,  True]], dtype=bool)
0reactions
mattipcommented, May 13, 2018

@frejanordsiek could you suggest a sentence or two? A pull request against the docs would be great, but even a draft something here would be nice

Read more comments on GitHub >

github_iconTop Results From Across the Web

FAQ - UTF-8, UTF-16, UTF-32 & BOM - Unicode
Each UTF is reversible, thus every UTF supports lossless round tripping: mapping from any Unicode coded character sequence S to a sequence of...
Read more >
Why does Unicode have big or little endian but UTF-8 doesn't?
UCS-2 hasn't been an encoding since Unicode 2.0 in 1996; it just indicates a process not interpreting surrogate pairs into their supplementary ...
Read more >
BOM endian in C - Stack Overflow
I'm confused as for how to read the file and test if the first byte or bytes containing the BOM are little endian...
Read more >
UTF-8 and Unicode FAQ for Unix/Linux
With the UTF-8 encoding, Unicode can be used in a convenient and ... Current plans are that there will never be characters assigned...
Read more >
UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE encodings ...
As I recently discovered when someone filed a PR on chardet (see ... are not handled correctly by the endian-specific encodings UTF-16LE, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found