Improve bidi support
See original GitHub issueHello,
We’re considering using CodeMirror as an XML editor in Knora, but the main obstacle for us would be the incomplete bidi support. We’d be interested in helping to improve it, at least by clarifying what doesn’t work and by testing fixes. (I’m an Arabic-speaking developer.)
The first thing I noticed when trying the XML autocomplete demo is that CodeMirror has the classic problem of jumbling text and punctuation in bidirectional text with markup. As the W3C says, in Problems with bidirectional source text in markup:
Unless your editor recognizes markup in source text as not being normal text, the strongly typed letters and punctuation in the markup will appear in places you wouldn’t expect, and sometimes interfere with the order of the content itself…
If you are dealing with content that is predominantly in a right-to-left script, the ideal solution would be a source editor that recognizes markup as a special construct, and protects it to produce a sensible order for the characters in the source text.
Here’s an example. Suppose I take this XML document:
<top>This is a test.<animal name="duck." type="bird">apple</animal></top>
Now I translate the text into Arabic:
- This is a test: هذا اختبار
- duck: بطة
- bird: طائر
- apple: تفاحة
The result looks like this:
<top>هذا اختبار.<animal name="بطة." type="طائر">تفاحة</animal></top>
CodeMirror displays it the same way:
The problems are:
- The positions of the words طائر and تفاحة are switched, the quotation mark (
"
) after تفاحة is in the wrong place, and the right angle bracket (>
) at the end of theanimal
tag is displayed as a left angle bracket. - Each of the full stops (
.
) after هذا اختبار and بطة should be to the left of the preceding text.
This happens because the Unicode bidi algorithm has incorrectly identified a sequence of characters containing punctuation as a run of RTL characters, or as a run of LTR characters.
To solve this problem, it isn’t enough to add the attribute dir="rtl"
to the html
tag. If you do that, you get:
This replaces the problems above with other problems:
- The second tag looks like a
<name>
tag with ananimal
attribute, rather than an<animal>
tag with aname
attribute. - The slash inside the closing
</animal>
tag has moved inside the closing</top>
tag. - The opening tags go from right to left, but the closing tags go from left to right.
Again, these are symptoms of the limitations of the Unicode bidi algorithm.
Fortunately this problem doesn’t seem to be difficult to solve in HTML. This can be done by adding <span>
tags with appropriate dir
attributes, as suggested in the W3C document Inline markup and bidirectional text in HTML. The following approach seems to work with Chrome version 50 and Firefox version 48:
- Put a
<span dir="ltr">...</span>
around each XML tag. - Put a
<span dir="rtl">...</span>
around any RTL content enclosed by an element or in the value of an attribute. - For Firefox only: add a zero-width space (
​
) between two<span>
elements if nothing else is separating them. (This is necessary only if the overall direction of the HTML document is RTL.)
The example XML document can be rendered correctly using the following HTML:
<span dir="ltr"><top></span>​<span dir="rtl">هذا اختبار.</span>​<span dir="ltr"><animal name="<span dir="rtl">بطة.</span>" type="<span dir="rtl">طائر</span>"></span>​<span dir="rtl">تفاحة</span>​<span dir="ltr"></animal></span>​<span dir="ltr"></top></span>
This works regardless of whether the overall direction of the HTML document is LTR or RTL. In an LTR context, the tags are ordered from left to right, the angle brackets are correct, and all the RTL content is displayed correctly and in the right places:
In an RTL context, the tags are ordered from right to left, and everything else is still correct:
Here’s a plain HTML page illustrating the problem and the proposed solution.
Does it seem feasible to implement this solution in CodeMirror? If so, how can we help?
Issue Analytics
- State:
- Created 7 years ago
- Reactions:2
- Comments:9 (2 by maintainers)
Top GitHub Comments
I want to emphasize the point that @ahangarha is trying to make. Situations in which both RTL & LTR paragraphs exist in the same text container are pretty common. A simple example would be a Jupyter notebook markdown cell, which contains text paragraphs in an RTL language, as well as a latex equation block which should be displayed in LTR.
HTML5 addressed this problem with
dir=auto
attribute, which detects the direction of each paragraph based on the direction of its first character. Applying this attribute to ourcm-line
objects seems to do the trick just fine but if I understand correctly, @marijnh mentions that we need a way to get some feedback and know which direction has browser assigned to eachcm-line
. Unfortunately I couldn’t find a way to do that.But what if we detected the direction ourselves, instead of using
dir=auto
? We just have to check the first character in each paragraph, and a gist (here) already shows that it’s not that hard.I’m exited to work on it as I’m frustrated with the way bidi text is handled in Jupyter notebook.
I have another issue regarding bidi. I see CodeMirror has a page related to bidi which as per my understanding is not really bidi. Bidi is to deal with texts which can be either RTL or LTR and then, provide some solution with which, browser can handle and show the text in right direction.
As per my experience, adding
dir="auto"
to all elements that can contain text in RTL or LTR would solve the problem very much effectively.This is the result of tweak I have made on your site:
By the way, I have made a Firefox add-on called Add Bidi Support to somehow apply what I mean in pages. There are some screenshot of its impact on pages. Take a look.
Should I keep this suggestion here or I should open another issue with almost same title?