Spike: Split content into chunks after Highlight.js has highlighted content

Currently, when viewing large source files we split content into chunks and then highlight the content with Highlight.js as the user scrolls the page. This ensures that the least amount of processing is spent on highlighting while the page is being loaded which in turn has a positive impact on our performance metrics.

However, splitting content into chunks and then highlighting in chunks does have some technical downsides (noted by the maintainer of Highlight.js).

Context

This has been noted by the maintainer of Highlight.js who was kind enough to reach out to us with a concern about highlighting in chunks.

I just wanted to say this approach has some technical downsides... several grammars match patterns ACROSS line boundaries and this is something we allow (sometimes it's the simplest solution). In our opinion splitting content like this and parsing chunks is "not a safe approach" when one needs 100% accuracy.

Asking us to highlight content in 70 line chunks is eventually going to break when certain content crosses over one of those chunk boundaries... Usually we deal well with truncated content (which we'll just assume ended early)... the problem is when we resume highlighting on the next 70 line chunk we'll be totally confused.

You can imagine the worst case of this would be coming into the middle of a string and thinking it's program code. It's possible that entire 70 line chunk could be rendered poorly since we'd likely have backwards the idea of what is string and what is code.

If the performance problem wasn't the highlight time itself (the textual analysis) but mostly the render time I'd perhaps consider instead using our private (but long-lived) Emitter API to gain access to the raw parse tree itself... and then either handle adaptive chunks of that OR do your 70 line split AFTER the highlighting has taken place...

The real problem here is that when patterns can match across chunk boundaries there is no "safe" way to "jump to page 50" without first parsing all 49 prior pages that so that you understand the context going into page 50...

Suggestions:

If the performance problem wasn't the highlight time itself (the textual analysis) but mostly the render time I'd perhaps consider instead using our private (but long-lived) Emitter API to gain access to the raw parse tree itself... and then either handle adaptive chunks of that
OR split content into chunks AFTER the highlighting has taken place

Since we'll be highlighting the entire file we might want to consider offloading the highlighting task to a WebWorker as part of this issue.

Edited Feb 06, 2023 by Jacques Erasmus