Normalize characters with combining marks to precomposed characters
See original GitHub issueI ran into a little weird problem which I wanted to solve. And here it is:
I have a PDF file with German Umlauts (������) and if I copy & paste them into the TinyMCE from WordPress I get the vowel (uoaUOA) which followed by a diaeresis (http://www.fileformat.info/info/unicode/char/0308/index.htm) instead of just one precomposed character.
This results in some problems:
Search for words with umlauts doesn't work
Proofreading fails
W3C validation fails with warning "Text run is not in Unicode Normalization Form C." because precomposed characters are prefered (See: http://www.w3.org/International/docs/charmod-norm/#choice-of-normalization-form)
Solution: I made a proof-of-concept with the “content_save_pre” filter in WordPress and it works. In this proof-of-concept I just replaced the two characters with the precomposed character:
$content = str_replace( “a\xCC\x88”, “�”, $content ); $content = str_replace( “o\xCC\x88”, “�”, $content ); $content = str_replace( “u\xCC\x88”, “�”, $content ); $content = str_replace( “A\xCC\x88”, “�”, $content ); $content = str_replace( “O\xCC\x88”, “�”, $content ); $content = str_replace( “U\xCC\x88”, “�”, $content );
If we could (I know we can’t, because WP is still supporting PHP 5.2) rely on PHP 5.3+ we could use a function for that: http://php.net/manual/de/normalizer.normalize.php
So the above code would be just one line and much more general: $content = normalizer_normalize($content, Normalizer::FORM_C );
Fun facts: The problem is just on Mac OS X (Lion, 10.7.5 and 10.9.5) for me (on Ubuntu 14.04 or Win 7 I couldn’t reproduce the problem).
Description of problem: Pasting from non-precomposed characters
Steps to reproduce:
- Paste Text from attached PDF (which don’t work - because the file extension is not allowed 😦 )
Expected result: Precomposed characters
Actual result: vowels plus diaeresis (vowel plus combining character)
I submitted a bugreport (incl patch) for WordPress, but maybe this can be solved better in TinyMCE.
Legacy information imported from TinyMCE bug tracker:
#T7243 posted by zodiac1978
Tags: [firefox msie safari chrome] Status: Open Resolution: None Attached URL: none
Issue Analytics
- State:
- Created 9 years ago
- Comments:9 (1 by maintainers)
With ES6 we have a normalize function in JS: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
And there is a polyfill for older browsers: https://github.com/walling/unorm
Still valid issue. Please re-open.