question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Auto-detect language per-line is guaranteed to produce poor results

See original GitHub issue

Hey, current maintainer of Highlight.js here. This just came to my attention via #391.

You’re looping over lines and then calling highlightAuto on every line (when you don’t have a known language). This is not recommended and guaranteed to produce poor results. Auto-detect is not intended to be useful with such little data and the noise will often (as reported in #391) be much higher than the signal - you’re just as likely to get random languages than anything useful. There will be color, but often all wrong.

If you do wish to use auto-detect you should pass us the ENTIRE document (or at the very least all the available lines from the document/diff), then look at the language we determine it to be, then use that language for every single line.

You’ll have to take this approach with version 11 anyways since you’ll have to do the highlighting in a single pass (rather than per-line). So calling highlightAuto upfront for all available lines and letting it use the greater amount of content available for it’s auto-detection… then splitting that result back out into the individual lines you need - already highlighted for you.

You’ll have to do it twice of source, once each for the before and after streams.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
joshgoebelcommented, Jun 5, 2021

If I invoke highlight.js for each line individually like this hljs.highlight(codeString, { language, ignoreIllegals: true }) is it going to be a problem in v11?

That’s not really what you want to do as it will break on any scopes that persist past the end of a line boundary. What you’d really want to do:

  • Collect all sequential diff lines
  • Do this for both the “original” and the “changed” versions
  • So you’ll have say 10 lines of code in two strings now, “before” and “after”
  • Highlight both blocks of code.
  • Split the lines (this will require some small amount of parsing and fixing up tags to end at newlines and re-open on the following line)… that can be done in 5-10 lines though as our HTML output is VERY clean easy easy to parse.

You’d really need to do this for each section of a diff (if they are non-sequential). So if a diff included 3 discrete changes, ~10 lines each then you’d be grouping each of those 3 changes into blocks and then highlighting all 3 blocks. Then splitting them apart again to get at the individual highlighted lines.

1reaction
rtfpessoacommented, Jun 4, 2021

(@rtfpessoa This whole thread might be useful reading: https://meta.stackexchange.com/q/355852/188348)

@iHiD thanks for the reference. Will definitely read it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Auto-detect language per-line is guaranteed to produce poor results
Auto-detect language per-line is guaranteed to produce poor results. ... then calling highlightAuto on every line (when you don't have a known language)....
Read more >
Ensure Auto-detect (PREVIEW) in Reading Progress uses ...
Auto-detect supports a number of dialects and pronunciations, and may assess student performance poorly if it interprets the uploaded document as the wrong ......
Read more >
Troubleshoot LAN Switching Environments - Cisco
A link light does not guarantee that the cable is fully functional. ... Many performance-related support calls are avoided if you create a ......
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
Here is an informal performance comparison for some of these IO methods. Note. For examples that use the StringIO class, make sure you...
Read more >
Hyperion SQR Production Reporting Developer's Guide Volume 2 ...
The Oracle's Hyperion® SQR® Production Reporting language is a ... File containing program arguments, one argument per line. ... FILE, or AUTO-DETECT.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found