Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

consider changing `Text` nodes's JSON serialization structure

See original GitHub issue

This is something I’m unsure of… would love feedback.

There are a few different ways a text node could be serialized, and they have different tradeoffs. The reason I open this issue up is because I’m not sure the current structure we’ve chosen makes the right tradeoffs, and if not I want to fix this sooner rather than later.

For example, given the text:

A line of rich text.

There are a handful of ways to represent it…

Split-text Ranges

This is the current structure. If a text node has marks in it, the ranges array contains ranges that split up the text, according to the overlapping marks in each section. You’d end up with JSON of:

{
  kind: 'text',
  ranges: [
    {
      text: 'A ',
      marks: [],
    },
    {
      text: 'line',
      marks: [{ kind: 'mark', type: 'bold', data: {}}],
    },
    {
      text: ' of ',
      marks: [],
    },
    {
      text: 'rich',
      marks: [{ kind: 'mark', type: 'bold', data: {}}],
    },
    {
      text: ' text.',
      marks: [],
    }
  ]
}

function toPlaintext(node) {
  return node.ranges.map(r => t.text).join('')
}

function toXml(node) {
  return node.ranges.map((range) => {
    return range.marks.reduce((xml, mark) => {
      return `<${mark.type}>${xml}</${mark.type}>`
    }, range.text)
  }).join('')
}

This is similar to the approach Prosemirror uses, although its version has a text node for each range, instead of a text node comprising a list of ranges.

Pros

Very easy to construct nested serialized forms from it like XML (eg. <bold>rich</bold>) because you can simply iterate through the ranges array and build them.
Still somewhat easy to construct the entire string of text, because you can ranges.map(r => r.text).join('') which gives it to you.

Cons

The full string of text is not readable, it’s very hard to look at a definition for a text node with marks on it and recognize what the text is.
In the common case of a paragraph without marks, the definition still contains a ranges array that is populated with a single range of text, which is slightly more complex.
Can be less efficient size-wise than some other forms in cases where the same mark is used multiple times in a single text node, since the mark is repeated for each use. (I’m not sure if this matters really when you factor in GZIP though?)

Index-based Ranges

Another approach would be to keep the text as a single string, and have the marks accompanied by offsets in the string, like so:

{
  kind: 'text',
  text: 'A line of rich text.',
  ranges: [
    {
      start: 2,
      end: 6,
      marks: [{ kind: 'mark', type: 'bold', data: {}}],
    },
    {
      start: 10,
      end: 14,
      marks: [{ kind: 'mark', type: 'bold', data: {}}],
    },
  ]
}

function toPlaintext(node) {
  return node.text
}

function toXml(node) {
  return ????
}

This is the approach Draft.js uses. Although instead of start/end it uses offset/length, which would match our operations more, so that might be preferred. (They’re probably pretty equivalent since either can be derived easily from the other.)

Pros

Very easy to read the entire string of text by itself. And easy to get a sense for which marks are applied to the string.

Cons

Although it’s easy to see which marks are somewhere in the string, it’s not easy to see exactly where they are applied, since you have to do the offset math in your head.
Harder to reason about what the logic would be to build up a nested serialized form like XML (eg. <bold>rich</bold>) because you can’t just loop the ranges. (Unsure how hard this actually is?)
Can be less efficient size-wise than some other forms in cases where the same mark is used multiple times in a single text node, since the mark is repeated for each use. (I’m not sure if this matters really when you factor in GZIP though?)

Mark-based Ranges

Another approach would be to treat the marks themselves as the primary grouping factor, resulting in the least possible duplication in the mark value, which is the place where the biggest size wasting can be.

{
  kind: 'text',
  text: 'A line of rich text.',
  ranges: [
    {
      mark: { kind: 'mark', type: 'bold', data: {}},
      indexes: [
        { start: 2, end: 6 },
        { start: 10, end: 14 },
      ]
    },
  ]
}

function toPlaintext(node) {
  return node.text
}

function toXml(node) {
  return ????
}

Pros

Still very easy to read the entire string of text.
Probably the absolute most space-efficient in terms of least unnecessary repetition of marks. (Although I’m not sure if this really matters when GZIP is considered.)

Cons

Potentially even harder to build up the nested serialized form like XML (eg. <bold>rich</bold>) because the indexes are further nested/complicated?
Very hard to reason about which marks are exactly where in the text.

External Mark Dictionary

There’s another approach that would have the marks defined outside of the text nodes themselves, at the top-level of the document. This is actually the most efficient. However, I’m not going to consider this one because I think having nodes be self-contained is much more important here. Otherwise you’d need to carry that dictionary down the tree for each node you render, which is not fun.

This is something that Draft.js use to use for the “entities”, but they’ve since migrated away I think, for the reasons discussed.

If anyone has thoughts (or even alternate structures I haven’t considered) I’d love to hear them! Or if you’d had experience working with multiple structures and have preferences/ideas.

Thanks!

Issue Analytics

State:
Created 6 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

4reactions

ianstormtaylorcommented, Dec 22, 2017

@tpreusse an AST explorer for Slate would be great!

@tuanmng that’s essentially what we have now, except it turns text nodes into arrays instead of objects. Which definitely makes them more terse, although I think it’s slightly more confusing for a record to be serialized into an array, especially when multiple records are concerned.

The issue with inlines vs. marks I think extends to outside of editing. (Although, it’s a very nuanced distinction, which I often question myself haha.) But basically… inlines are nodes that have some semantic value as a distinct unit—for example a link.

The thing with marks is that they are order-independent—they’re stored as a Set. Which is good, because for formatting this is how you want to think of them, either some text is bold or not, but it doesn’t matter whether it’s bold then italic, or italic then bold.

Since they are order independent, you can render them as <bold><italic>text</italic></bold> text or <italic><bold>text</bold></italic> and that should be equivalent. And since, unlike inlines, they are not a distinct unit, you can rendering overlapping ranges of marks in any way you please, as long as each characters ends up receiving the marks they need.

With marks, once two of the same mark become adjacent, the entire span of text has the mark.

However, with inlines those properties are different. To break an inline into two parts is to change it’s meaning, or to have 2 inlines. If you model things that are expected to be inlines as marks, you can end up with unwanted behavior. Consider a bold and link mark interaction:

A line of text with <a href="https://google.com">an </a><strong><a href="https://google.com">important</a></strong><a href="https://google.com"> link</a> in it.

A line of text with an important link in it.

Here you actually end up with three links, each to the same place, because you could not guarantee the render ordering of the marks. Sometimes you’ll get 3 links, sometimes 1. And if you style them with underlines for instance, that breakage will be apparent to end users.

In certain cases, if you know the schema of the content, you can use that knowledge to enforce your own rendering order to the marks, so that you could use link marks without this problem happening. But since Slate doesn’t inherently know the schema, it doesn’t do that.

Thank you all!

After writing this up, reading the comments, and thinking it through some more, I’m happy with the current Slate structure. I think it prioritizes being able to use the structure easily (for rendering, serializing, etc.) and it makes using it the “correct” way simple, which fits nicely with Slate’s goal to prevent leaking unnecessary complexity into your codebases. Otherwise it seems like everyone is going to be re-inventing the same, more complex function to parse range indexes into a usable format to render things with.

It does that at a slight tradeoff in terms of efficiency and readability, but since efficiency is largely mitigated by GZIP, and the readability is only in the JSON form which people aren’t reading

1reaction

ianstormtaylorcommented, Dec 21, 2017

Haha thanks @CameronAckermanSEL! Don’t worry, we will not go the entity map route.

I’m even thinking that the current way might be the best, for a similar reason. The reason entity map was so horrible was because it makes the objects themselves not self-contained, so you have to keep weird state from elsewhere around as you recurse through the tree. Since Slate is tree-based, where Draft is not, I feel like this might be even more reason to keep the current structure in which ranges are completely self-contained.