Gim/TREE_SITTER_HIGHLIGHTING_PLAN.md
2026-04-06 22:31:40 -07:00

15 KiB

Tree-sitter Highlighting Implementation Plan

This document is the working plan for replacing Chroma-based highlighting with a Tree-sitter-first syntax system in Gim.

The current renderer in internal/editor/view.go is tightly coupled to Chroma and computes syntax styles during rendering. That is the opposite of the architecture Tree-sitter wants. Tree-sitter works best when parsing and highlighting are maintained as buffer state and rendering only consumes cached results.

This plan assumes:

  • Chroma will be removed entirely.
  • The renderer can be rebuilt to better fit the new syntax model.
  • We are willing to do a full-buffer parse and full rehighlight first, then optimize incrementally.
  • Correct architecture matters more than preserving the current render pipeline.

Project Goal

Build a syntax system where:

  • each buffer owns syntax state
  • Tree-sitter parsing is maintained across edits
  • highlights are cached outside the renderer
  • the renderer consumes precomputed style data
  • byte-oriented parser results are converted into rune-oriented render data

Success Criteria

  • internal/editor/view.go does not directly call Chroma or Tree-sitter
  • Chroma is fully removed from the codebase
  • syntax state exists independently from rendering
  • each buffer can be parsed and highlighted through Tree-sitter
  • the renderer reads cached highlight data for visible lines
  • edits invalidate and recompute syntax state
  • the system handles UTF-8 text correctly
  • multi-line captures work correctly
  • incremental parsing exists for normal text edits
  • syntax-related behavior has focused tests

Architectural Direction

The target data flow is:

Buffer -> Syntax Engine -> Highlight Cache -> Renderer

Not:

Renderer -> Parse -> Highlight -> Draw

Core separation of concerns:

  • internal/core Holds text and buffer mutation behavior.
  • internal/syntax Owns parser state, queries, highlight cache, invalidation, and update logic.
  • internal/style Owns theme mapping from capture names to lipgloss.Style.
  • internal/editor Owns rendering, cursor, selection overlay, gutters, statusline, and viewport logic.

Key Constraints And Risks

Byte vs Rune Indexing

Tree-sitter reports positions in bytes.

The editor currently renders by runes.

This means the syntax engine must own conversion from byte-based capture ranges to rune-based render ranges. This conversion should never be spread across the renderer.

  • define one internal representation for parser/query positions
  • define one internal representation for render positions
  • keep conversion logic isolated inside internal/syntax

Multi-line Captures

Strings, comments, and some language constructs can span lines.

  • highlight cache supports ranges spanning multiple lines
  • renderer can consume per-line results from multi-line captures

Query Precedence

Tree-sitter queries can produce overlapping captures.

  • define deterministic precedence rules
  • document how broad captures and specific captures are resolved

Full Parse First, Incremental Later

The initial version does not need to be optimal.

  • initial version can parse and rehighlight the full buffer
  • follow-up version uses tree.Edit, old trees, and changed ranges

Target Package Layout

Planned package layout:

  • internal/syntax/types.go
  • internal/syntax/engine.go
  • internal/syntax/state.go
  • internal/syntax/registry.go
  • internal/syntax/treesitter.go
  • internal/syntax/query.go
  • internal/syntax/cache.go
  • internal/style/theme.go or equivalent capture-to-style mapping helpers

Likely existing files to update:

  • internal/editor/model.go
  • internal/editor/model_builder.go
  • internal/editor/view.go
  • internal/core/buffer.go
  • internal/command/handlers.go
  • go.mod

Likely files to remove or heavily reduce:

  • Chroma-specific logic in internal/style/style.go
  • direct Chroma setup in editor model builders and command handlers

Data Model Plan

1. Syntax Engine

The syntax engine should be editor-facing and buffer-aware.

Responsibilities:

  • attach syntax state to buffers
  • initialize parser and query data from filetype
  • reparse after edits
  • maintain dirty regions or dirty lines
  • build cached line highlight results
  • expose line results to the renderer

Checklist:

  • define Engine interface in internal/syntax/engine.go
  • decide whether syntax state is owned directly by the engine or attached to buffers
  • add a field on editor.Model for the syntax engine

2. Per-buffer Syntax State

Each buffer needs syntax state. The important point is that syntax is buffer-level, not window-level.

Suggested fields:

  • parser
  • language
  • query or compiled query set
  • current parse tree
  • source snapshot or source builder access
  • dirty line or dirty range tracking
  • cached line highlight results
  • version counter for cache invalidation

3. Highlight Cache Representation

Start with the representation that makes integration easiest.

Recommended first version:

  • cached per-line []lipgloss.Style

Recommended longer-term representation:

  • cached per-line spans like []Span{StartRune, EndRune, StyleID}

Implementation choice:

  • phase 1 uses per-rune style maps for easiest renderer integration
  • phase 2 evaluates switching internal cache to spans

4. Theme Mapping

Theme logic should map Tree-sitter captures such as keyword, function, string, comment, and type.builtin to lipgloss.Style.

Checklist:

  • create capture-name to style mapping layer
  • support fallback from specific captures to broader categories
  • keep theme logic independent from parser/query logic

Phased Implementation Plan

Phase 0: Cleanly Commit To Tree-sitter

Purpose:

Remove architectural assumptions that only make sense for Chroma.

Tasks:

  • decide the initial supported filetypes for Tree-sitter
  • decide where query files live and how they are loaded
  • decide whether main.go demo code should be removed or moved to a more explicit demo location
  • audit Chroma references in the repo
  • list all codepaths that currently construct or depend on style.ChromaStyle

Done when:

  • there is a clear inventory of Chroma-coupled code
  • there is a clear inventory of Tree-sitter assets to load per language

Phase 1: Introduce Syntax As A Real Subsystem

Purpose:

Create the new architecture boundary before changing rendering behavior.

Tasks:

  • create internal/syntax
  • define the engine interface
  • add a syntax engine field to editor.Model
  • initialize the syntax engine in model construction
  • remove direct highlighting calls from view.go
  • route visible line highlighting through the syntax engine

Done when:

  • view.go asks the syntax subsystem for line highlight data
  • syntax work no longer begins inside the render loop itself

Phase 2: Define Buffer Text Access And Edit Notifications

Purpose:

Make buffer mutations visible to the syntax system in a structured way.

Tasks:

  • decide whether edits are emitted from core.Buffer or from editor actions
  • define an internal edit event type
  • include enough data for Tree-sitter incremental edits later
  • wire SetLine, InsertLine, and DeleteLine changes into syntax invalidation
  • decide whether first version uses whole-buffer invalidation

Suggested edit event fields:

  • start byte
  • old end byte
  • new end byte
  • start point
  • old end point
  • new end point
  • affected line range

Done when:

  • syntax invalidation happens when text changes
  • invalidation does not depend on the render loop noticing text changed

Phase 3: Build Minimal Tree-sitter Registry And Loader

Purpose:

Provide one place that maps filetypes to languages and queries.

Tasks:

  • create a registry for language metadata
  • map filetype strings to Tree-sitter language bindings
  • map filetypes to highlight query file paths
  • load and compile queries once per language where practical
  • define behavior for unsupported filetypes

Done when:

  • opening a supported buffer can resolve a language and query set
  • unsupported buffers degrade cleanly without crashing the renderer

Phase 4: Implement Full-buffer Parsing And Full-buffer Highlighting

Purpose:

Get correct Tree-sitter highlighting working before optimizing.

Tasks:

  • create per-buffer syntax state
  • build full source text from buffer contents
  • parse full source text into a tree
  • run highlight query across the full tree
  • collect captures in deterministic order
  • resolve overlapping captures consistently
  • convert capture byte ranges into per-line rune-based style maps
  • cache line results for renderer consumption

Done when:

  • a supported filetype can be fully highlighted without Chroma
  • renderer uses cached line results from Tree-sitter

Phase 5: Rebuild Renderer Integration Around Cached Syntax Data

Purpose:

Simplify the renderer so it consumes syntax cache rather than doing syntax work.

Tasks:

  • redesign line render input around line text plus syntax cache
  • ensure gutter rendering stays independent from syntax rendering
  • ensure cursor overlay works on top of syntax styling
  • ensure visual selection overlay works on top of syntax styling
  • verify blank lines and end-of-line cursor rendering still behave correctly
  • verify window width padding still uses background style consistently

Done when:

  • line drawing is purely a render operation
  • no parser or query logic exists in view.go

Phase 6: Remove Chroma Completely

Purpose:

Delete the old highlighting path and simplify styling around capture-based theming.

Tasks:

  • remove Chroma dependencies from go.mod
  • remove GetLexer
  • remove MakeStyleMap
  • remove Styles.ChromaStyle if no longer needed
  • replace Chroma-derived theme extraction with explicit Gim theme definitions
  • update commands that currently switch Chroma styles

Done when:

  • the build no longer depends on Chroma packages
  • no codepath references Chroma tokens, lexers, or styles

Phase 7: Add Incremental Parsing

Purpose:

Move from correct-but-simple to correct-and-efficient.

Tasks:

  • preserve old trees per buffer
  • call tree.Edit before reparsing
  • parse new content using the old tree
  • compute changed ranges
  • decide whether rehighlighting happens by changed byte range, changed point range, or affected line range
  • update only changed cache regions
  • verify cache invalidation around inserted and deleted lines

Done when:

  • small edits do not require full-buffer reparsing and rehighlighting
  • highlighting updates correctly after insertions, deletions, joins, and splits

Phase 8: Improve Cache Representation If Needed

Purpose:

Reduce memory churn and simplify overlay logic if per-rune style maps become too heavy.

Tasks:

  • measure cost of per-line []lipgloss.Style
  • consider switching internal storage to spans
  • keep renderer-facing API stable if possible
  • optimize only after correctness and incremental behavior exist

Done when:

  • cache format is deliberate rather than inherited from the old renderer

Phase 9: Expand Language Support

Purpose:

Generalize the system after the first language works well.

Tasks:

  • ship one language first, likely Go
  • add additional language bindings and queries one by one
  • verify filetype detection and registry behavior for each language
  • define how language-specific capture tweaks are handled

Done when:

  • the system can scale beyond a single demo language without architectural changes

Phase 10: Testing And Verification

Purpose:

Make syntax behavior trustworthy as the engine evolves.

Tasks:

  • add unit tests for registry lookup
  • add unit tests for byte-to-rune range conversion
  • add unit tests for overlapping capture resolution
  • add unit tests for multi-line highlight extraction
  • add integration tests for visible rendering of highlighted lines
  • add edit tests for incremental updates after insert, delete, split, and join operations
  • add tests covering UTF-8 characters and mixed-width content

Done when:

  • syntax bugs can be reproduced and locked down with tests

Suggested Order Of Attack

If working on this piece by piece, this is the recommended order:

  • Phase 1 first
  • Phase 2 second
  • Phase 3 third
  • Phase 4 fourth
  • Phase 5 fifth
  • Phase 6 sixth
  • Phase 7 seventh
  • Phase 10 continuously during all phases
  • Phase 8 only if profiling says it matters
  • Phase 9 after one language is solid

Concrete First Milestone

The first milestone should be intentionally small but architectural.

Milestone goal:

  • create internal/syntax
  • add syntax engine field to editor.Model
  • make view.go consume syntax results instead of computing syntax itself
  • use placeholder or basic full-buffer syntax data, even if the first output is minimal

This milestone matters because it breaks the most important bad dependency: rendering owning syntax.


Concrete Second Milestone

Milestone goal:

  • support one language with Tree-sitter full-buffer parse and full-buffer highlighting
  • cache per-line style results
  • render highlighted output without Chroma

Concrete Third Milestone

Milestone goal:

  • wire edit invalidation into buffer mutation paths
  • update Tree-sitter state after edits
  • keep highlights correct after normal editing commands

Concrete Fourth Milestone

Milestone goal:

  • add true incremental parse updates
  • rehighlight only changed regions
  • validate performance on larger files

Open Design Questions

  • Should syntax state live inside core.Buffer or stay in the syntax engine keyed by buffer ID?
  • Should the renderer consume per-rune styles or span-based styles?
  • Should the syntax engine rebuild full source text on demand, or should buffers expose a stable full-text API?
  • How should unsupported filetypes render: plain text or fallback queryless token classes?
  • Should theme capture fallback be static or configurable?
  • Should parser/query assets be embedded or read from disk at runtime?

Notes For Implementation

Guidelines while building this:

  • keep parsing and rendering separate from the first commit
  • optimize only after correctness is established
  • prefer one supported language done correctly over several partial languages
  • keep UTF-8 correctness in mind from the first Tree-sitter integration
  • avoid letting temporary renderer hacks become permanent API boundaries
  • test line split, line join, backspace-at-start, delete-at-end, and multi-line comments early

Definition Of Done

This project is done when all of the following are true:

  • Chroma is gone
  • Tree-sitter is the only syntax engine
  • syntax state is maintained outside rendering
  • edits update syntax state correctly
  • renderer consumes cached syntax data cleanly
  • highlight output is correct for supported languages
  • UTF-8 behavior is correct
  • incremental parsing is working
  • tests cover the risky pieces