consider changing `Text` nodes's JSON serialization structure
See original GitHub issueThis is something I’m unsure of… would love feedback.
There are a few different ways a text node could be serialized, and they have different tradeoffs. The reason I open this issue up is because I’m not sure the current structure we’ve chosen makes the right tradeoffs, and if not I want to fix this sooner rather than later.
For example, given the text:
A line of rich text.
There are a handful of ways to represent it…
Split-text Ranges
This is the current structure. If a text node has marks in it, the ranges
array contains ranges that split up the text, according to the overlapping marks in each section. You’d end up with JSON of:
{
kind: 'text',
ranges: [
{
text: 'A ',
marks: [],
},
{
text: 'line',
marks: [{ kind: 'mark', type: 'bold', data: {}}],
},
{
text: ' of ',
marks: [],
},
{
text: 'rich',
marks: [{ kind: 'mark', type: 'bold', data: {}}],
},
{
text: ' text.',
marks: [],
}
]
}
function toPlaintext(node) {
return node.ranges.map(r => t.text).join('')
}
function toXml(node) {
return node.ranges.map((range) => {
return range.marks.reduce((xml, mark) => {
return `<${mark.type}>${xml}</${mark.type}>`
}, range.text)
}).join('')
}
This is similar to the approach Prosemirror uses, although its version has a text node for each range, instead of a text node comprising a list of ranges.
Pros
- Very easy to construct nested serialized forms from it like XML (eg.
<bold>rich</bold>
) because you can simply iterate through theranges
array and build them. - Still somewhat easy to construct the entire string of text, because you can
ranges.map(r => r.text).join('')
which gives it to you.
Cons
- The full string of text is not readable, it’s very hard to look at a definition for a text node with marks on it and recognize what the text is.
- In the common case of a paragraph without marks, the definition still contains a
ranges
array that is populated with a single range of text, which is slightly more complex. - Can be less efficient size-wise than some other forms in cases where the same mark is used multiple times in a single text node, since the mark is repeated for each use. (I’m not sure if this matters really when you factor in GZIP though?)
Index-based Ranges
Another approach would be to keep the text as a single string, and have the marks accompanied by offsets in the string, like so:
{
kind: 'text',
text: 'A line of rich text.',
ranges: [
{
start: 2,
end: 6,
marks: [{ kind: 'mark', type: 'bold', data: {}}],
},
{
start: 10,
end: 14,
marks: [{ kind: 'mark', type: 'bold', data: {}}],
},
]
}
function toPlaintext(node) {
return node.text
}
function toXml(node) {
return ????
}
This is the approach Draft.js uses. Although instead of start/end
it uses offset/length
, which would match our operations more, so that might be preferred. (They’re probably pretty equivalent since either can be derived easily from the other.)
Pros
- Very easy to read the entire string of text by itself. And easy to get a sense for which marks are applied to the string.
Cons
- Although it’s easy to see which marks are somewhere in the string, it’s not easy to see exactly where they are applied, since you have to do the offset math in your head.
- Harder to reason about what the logic would be to build up a nested serialized form like XML (eg.
<bold>rich</bold>
) because you can’t just loop the ranges. (Unsure how hard this actually is?) - Can be less efficient size-wise than some other forms in cases where the same mark is used multiple times in a single text node, since the mark is repeated for each use. (I’m not sure if this matters really when you factor in GZIP though?)
Mark-based Ranges
Another approach would be to treat the marks themselves as the primary grouping factor, resulting in the least possible duplication in the mark value, which is the place where the biggest size wasting can be.
{
kind: 'text',
text: 'A line of rich text.',
ranges: [
{
mark: { kind: 'mark', type: 'bold', data: {}},
indexes: [
{ start: 2, end: 6 },
{ start: 10, end: 14 },
]
},
]
}
function toPlaintext(node) {
return node.text
}
function toXml(node) {
return ????
}
Pros
- Still very easy to read the entire string of text.
- Probably the absolute most space-efficient in terms of least unnecessary repetition of marks. (Although I’m not sure if this really matters when GZIP is considered.)
Cons
- Potentially even harder to build up the nested serialized form like XML (eg.
<bold>rich</bold>
) because the indexes are further nested/complicated? - Very hard to reason about which marks are exactly where in the text.
External Mark Dictionary
There’s another approach that would have the marks defined outside of the text nodes themselves, at the top-level of the document. This is actually the most efficient. However, I’m not going to consider this one because I think having nodes be self-contained is much more important here. Otherwise you’d need to carry that dictionary down the tree for each node you render, which is not fun.
This is something that Draft.js use to use for the “entities”, but they’ve since migrated away I think, for the reasons discussed.
If anyone has thoughts (or even alternate structures I haven’t considered) I’d love to hear them! Or if you’d had experience working with multiple structures and have preferences/ideas.
Thanks!
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (5 by maintainers)
Top GitHub Comments
@tpreusse an AST explorer for Slate would be great!
@tuanmng that’s essentially what we have now, except it turns text nodes into arrays instead of objects. Which definitely makes them more terse, although I think it’s slightly more confusing for a record to be serialized into an array, especially when multiple records are concerned.
The issue with inlines vs. marks I think extends to outside of editing. (Although, it’s a very nuanced distinction, which I often question myself haha.) But basically… inlines are nodes that have some semantic value as a distinct unit—for example a
link
.The thing with marks is that they are order-independent—they’re stored as a
Set
. Which is good, because for formatting this is how you want to think of them, either some text isbold
or not, but it doesn’t matter whether it’sbold
thenitalic
, oritalic
thenbold
.Since they are order independent, you can render them as
<bold><italic>text</italic></bold>
text or<italic><bold>text</bold></italic>
and that should be equivalent. And since, unlike inlines, they are not a distinct unit, you can rendering overlapping ranges of marks in any way you please, as long as each characters ends up receiving the marks they need.With marks, once two of the same mark become adjacent, the entire span of text has the mark.
However, with inlines those properties are different. To break an inline into two parts is to change it’s meaning, or to have 2 inlines. If you model things that are expected to be inlines as marks, you can end up with unwanted behavior. Consider a
bold
andlink
mark interaction:Here you actually end up with three links, each to the same place, because you could not guarantee the render ordering of the marks. Sometimes you’ll get 3 links, sometimes 1. And if you style them with underlines for instance, that breakage will be apparent to end users.
In certain cases, if you know the schema of the content, you can use that knowledge to enforce your own rendering order to the marks, so that you could use
link
marks without this problem happening. But since Slate doesn’t inherently know the schema, it doesn’t do that.Thank you all!
After writing this up, reading the comments, and thinking it through some more, I’m happy with the current Slate structure. I think it prioritizes being able to use the structure easily (for rendering, serializing, etc.) and it makes using it the “correct” way simple, which fits nicely with Slate’s goal to prevent leaking unnecessary complexity into your codebases. Otherwise it seems like everyone is going to be re-inventing the same, more complex function to parse range indexes into a usable format to render things with.
It does that at a slight tradeoff in terms of efficiency and readability, but since efficiency is largely mitigated by GZIP, and the readability is only in the JSON form which people aren’t reading
Haha thanks @CameronAckermanSEL! Don’t worry, we will not go the entity map route.
I’m even thinking that the current way might be the best, for a similar reason. The reason entity map was so horrible was because it makes the objects themselves not self-contained, so you have to keep weird state from elsewhere around as you recurse through the tree. Since Slate is tree-based, where Draft is not, I feel like this might be even more reason to keep the current structure in which ranges are completely self-contained.