Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RFC: TSDoc-flavored-Markdown (TSFM) instead of CommonMark

See original GitHub issue

Based on the issues encountered in the issue #12 thread, we are concluding that TSDoc cannot reasonably be based directly on the CommonMark spec. The goals are conflicting:

CommonMark goal: (“common” = union) Provide a standardized algorithm for parsing every familiar markup notation. It’s okay if the resulting syntax rules are impossible for humans to memorize, because mistakes can be easily corrected using the editor’s interactive preview. If a syntax is occasionally misinterpreted, the consequence is incorrect formatting on the web site, which is a relatively minor issue.
TSFM goal: (“common” = intersection) Provide a familiar syntax that is very easy for humans to memorize, so that a layperson can predict exactly how their markup will be rendered (by every possible downstream doc pipeline). Computer source code is handled by many different viewers which may not support interactive preview. If a syntax is occasionally misinterpreted, the consequence is that a tag such as @beta or @internal may be ignored by the parser, which could potentially cause a serious issue (e.g. an escalation from an enterprise customer whose service was interrupted because of a broken API contract).

Hypothesis: For every TSFM construct, there exists a normalized form that will be parsed identically by CommonMark and TSDoc. In “strict mode” the TSDoc library can issue warnings for expressions that are not in normalized form. Assuming the author eliminates all such warnings, then a documentation pipeline can passthrough unmodified TSDoc content to a backend CommonMark engine, and have confidence that the rendered output will be correct.

Below are some proposed TSFM restrictions:

Whitespace generally doesn’t matter

This principle is very easy for people to remember, and eliminates a ton of edge cases.

Example 1:

/**
 * TSFM considers this to be an HTML element, whereas CommonMark does not:
 * <element attribute="@tag"
 *
 * />
 */

Example 1 converted to normalized form (so CommonMark interprets it the same as TSDoc):

/**
 * TSFM considers this to be an HTML element, whereas CommonMark does not:
 * <element attribute="@tag"
 * />
 */

Example 2:

/**
 * CommonMark interprets this indentation to make a code block, TSFM sees rich markup:
 * 
 *     **bold** @tag
 */

Example 2 converted to normalized form (so CommonMark interprets it the same as TSDoc):

/**
 * CommonMark interprets this indentation to make a code block, TSFM sees rich markup:
 * 
 * **bold** @tag
 */

Stars cannot be nested arbitrarily

TSDoc will support stars for bold/italics, based on 6 types of tokens that can be recognized by the lexical analyzer with minimal lookahead:

Opening italics single-star, e.g. *text is interpreted as text
Closing italics single-star, e.g. text* is interpreted as text
Opening bold double-star, e.g. **text is interpreted as text
Closing bold double-star, e.g. text** is interpreted as text
Opening bold+italics triple-star, e.g. ***text is interpreted as if <b+i>text
Closing bold+italics triple-star, e.g. text*** is interpreted as if text</b+i>

Other patterns are NOT interpreted as star tokens, e.g. text * text * contains literal asterisks, as does ****a****. A letter in the middle of a word can never be styled using stars, e.g. Toys*R*Us contains literal asterisk characters. A single-star followed by a double-star can be closed by a triple-star (e.g. *italics **bold+italics*** is seen as italicsbold+italics</b+i>). Star markup is prohibited from spanning multiple lines.

Other characters (e.g. underscore) are NOT supported by TSDoc as synonyms for bold/italics.

Example 3:

/**
 * *CommonMark sees italics, but TSDoc does not because
 * its stars cannot span lines.*
 *
 * CommonMark sees italics here: __proto__
 *
 * Common**M**ark sees a boldfaced M, but TSDoc sees literal stars.
 */

Example 3 normalized form:

/**
 * \*CommonMark sees italics, but TSDoc does not because
 * its stars cannot span lines.\*
 *
 * CommonMark sees italics here: \_\_proto\_\_ (or better to use `__proto__`)
 *
 * Common\*\*M\*\*ark sees a boldfaced M, but TSDoc sees literal stars.
 *
 * If you really need to boldface a letter, use HTML elements: Common<b>M</b>ark.
 */

Example 4:

/**
 * For **A **B** C** the B is double-boldfaced according to CommonMark.
 * The TSDoc tokenizer sees `<b>A <b>B</b> C</b>` which the parser then flattens
 * to `<b>A **B</b> C**` because it doesn't allow nesting.
 *
 * Improper balancing also gets ignored, e.g. for **A *B** C* the TSDoc tokenizer
 * will see `<b>A <i>B</b> C</i>` which the parser flattens to `<b>A *B</b> C*`
 * Whereas CommonMark would counterintuitively see `<i><i>A<i>B</i></i>C</i>`.
 */

Example 4 normalized form:

/**
 * For **A \*\*B** C\*\* the B is double-boldfaced according to CommonMark.
 * The TSDoc tokenizer sees `<b>A <b>B</b> C</b>` which the parser then flattens
 * to `<b>A **B</b> C**` because it doesn't allow nesting.
 *
 * Improper balancing also gets ignored, e.g. for **A \*B** C\* the TSDoc tokenizer
 * will see `<b>A <i>B</b> C</i>` which the parser flattens to `<b>A *B</b> C*`
 * Whereas CommonMark would counterintuitively see `<i><i>A<i>B</i></i>C</i>`.
 */

Code spans are simplified

For TSFM, a nonescaped backtick will always start a code span and end with the next backtick. Whitespace doesn’t matter.

Example 5:

/**
 * `Both TSDoc and CommonMark
 * agree this is code.`
 *
 * before `CommonMark disagrees
 *
 * if a line is skipped, though.` after
 *
 * `But this is not code because the backtick is unterminated
 */

Example 5 normalized form:

/**
 * `Both TSDoc and CommonMark
 * agree this is code.`
 *
 * before `CommonMark disagrees
 * if a line is skipped, though.` after
 *
 * \`But this is not code because the backtick is unterminated
 */

Blocks don’t nest

I want to say that “>” blockquotes should not be supported at all, since the whitespace handling for these constructs is highly counterintuitive. Instead we would recommend <blockquote> HTML tags for this scenario.

Lists are a very useful and common scenario. However, CommonMark lists also have a lot of counterintuitive rules regarding handling of whitespace.

A simplification would be to say that TSFM interprets any line that starts with “-” as being a list item, and the list ends with the first blank line. No other character (e.g. “*” or “+”) can be used to create lists. If complicated nesting is required, then HTML tags such as <ul> and <li> should be used to avoid any confusion.

Example 6:

/**
 * A list with 3 things
 * - item 1
 *              - item 2
 * spans several
 *      lines
 * - item 3
 *
 * Two lists separated by a newline
 * -  list 1 with one item
 *
 * - list 2 with one item
 *
 * + not a list item
 * + not a list item
 *
 * CommonMark surprisingly considers this to be a list whose first item is another list,
 * whereas TSDoc sees a minus character as the first item:
 * - - foo
 */

Example 6 normalized form:

/**
 * A list with 3 things
 * - item 1
 * - item 2
 *   spans several
 *   lines
 * - item 3
 *
 * Two lists separated by a newline
 * -  list 1 with one item
 * <!-- CommonMark requires an HTML comment to separate two lists -->
 * - list 2 with one item
 *
 * \+ not a list item
 * \+ not a list item
 * 
 * CommonMark surprisingly considers this to be a list whose first item is another list,
 * whereas TSDoc sees a minus character as the first item:
 * - \- foo
 */

Issue Analytics

State:
Created 5 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

sharwellcommented, Dec 3, 2018

@pgonzal The current examples show an input that would be treated differently by the two implementations, but only show one normalized form. Each example should be presented with two normalized forms:

The normalized form that causes both parsers to interpret the input in the manner that TSFM interprets the original input
The normalized form that causes both parsers to interpret the input in the manner that CommonMark interprets the original input (this is the one that’s missing)

1reaction

dendcommented, Jul 9, 2018

Something worth calling out here is how this can interact with docs.microsoft.com/DocFX. Now, I know that we are working on a standard here, but fragmentation and a bunch of custom stuff is a bit of a concern. We do have support for Markdown Extensions, so likely that should be a place where we can plug in.

The format you are talking about here is parser-specific - on docs.microsoft.com, we’ve recently switched to MarkDig, that handles CommonMark parsing much better. It would be preferable to not be inventing our own standard due to the fact that the rest of the documentation stack does not use (and we have no plans to), and guiding people to one set of conventions for TS documentation contributions and another one for the rest of docs seems problematic. Besides, this also adds the added issue of our own parser interpreting the proposed conventions incorrectly.