Gim/TREE_SITTER_HIGHLIGHTING_PLAN.md
2026-04-06 22:31:40 -07:00

526 lines
15 KiB
Markdown

# Tree-sitter Highlighting Implementation Plan
This document is the working plan for replacing Chroma-based highlighting with a Tree-sitter-first syntax system in Gim.
The current renderer in `internal/editor/view.go` is tightly coupled to Chroma and computes syntax styles during rendering.
That is the opposite of the architecture Tree-sitter wants. Tree-sitter works best when parsing and highlighting are
maintained as buffer state and rendering only consumes cached results.
This plan assumes:
- Chroma will be removed entirely.
- The renderer can be rebuilt to better fit the new syntax model.
- We are willing to do a full-buffer parse and full rehighlight first, then optimize incrementally.
- Correct architecture matters more than preserving the current render pipeline.
---
## Project Goal
Build a syntax system where:
- each buffer owns syntax state
- Tree-sitter parsing is maintained across edits
- highlights are cached outside the renderer
- the renderer consumes precomputed style data
- byte-oriented parser results are converted into rune-oriented render data
---
## Success Criteria
- [ ] `internal/editor/view.go` does not directly call Chroma or Tree-sitter
- [ ] Chroma is fully removed from the codebase
- [ ] syntax state exists independently from rendering
- [ ] each buffer can be parsed and highlighted through Tree-sitter
- [ ] the renderer reads cached highlight data for visible lines
- [ ] edits invalidate and recompute syntax state
- [ ] the system handles UTF-8 text correctly
- [ ] multi-line captures work correctly
- [ ] incremental parsing exists for normal text edits
- [ ] syntax-related behavior has focused tests
---
## Architectural Direction
The target data flow is:
`Buffer -> Syntax Engine -> Highlight Cache -> Renderer`
Not:
`Renderer -> Parse -> Highlight -> Draw`
Core separation of concerns:
- `internal/core`
Holds text and buffer mutation behavior.
- `internal/syntax`
Owns parser state, queries, highlight cache, invalidation, and update logic.
- `internal/style`
Owns theme mapping from capture names to `lipgloss.Style`.
- `internal/editor`
Owns rendering, cursor, selection overlay, gutters, statusline, and viewport logic.
---
## Key Constraints And Risks
### Byte vs Rune Indexing
Tree-sitter reports positions in bytes.
The editor currently renders by runes.
This means the syntax engine must own conversion from byte-based capture ranges to rune-based render ranges. This conversion should never be spread across the renderer.
- [ ] define one internal representation for parser/query positions
- [ ] define one internal representation for render positions
- [ ] keep conversion logic isolated inside `internal/syntax`
### Multi-line Captures
Strings, comments, and some language constructs can span lines.
- [ ] highlight cache supports ranges spanning multiple lines
- [ ] renderer can consume per-line results from multi-line captures
### Query Precedence
Tree-sitter queries can produce overlapping captures.
- [ ] define deterministic precedence rules
- [ ] document how broad captures and specific captures are resolved
### Full Parse First, Incremental Later
The initial version does not need to be optimal.
- [ ] initial version can parse and rehighlight the full buffer
- [ ] follow-up version uses `tree.Edit`, old trees, and changed ranges
---
## Target Package Layout
Planned package layout:
- [ ] `internal/syntax/types.go`
- [ ] `internal/syntax/engine.go`
- [ ] `internal/syntax/state.go`
- [ ] `internal/syntax/registry.go`
- [ ] `internal/syntax/treesitter.go`
- [ ] `internal/syntax/query.go`
- [ ] `internal/syntax/cache.go`
- [ ] `internal/style/theme.go` or equivalent capture-to-style mapping helpers
Likely existing files to update:
- [ ] `internal/editor/model.go`
- [ ] `internal/editor/model_builder.go`
- [ ] `internal/editor/view.go`
- [ ] `internal/core/buffer.go`
- [ ] `internal/command/handlers.go`
- [ ] `go.mod`
Likely files to remove or heavily reduce:
- [ ] Chroma-specific logic in `internal/style/style.go`
- [ ] direct Chroma setup in editor model builders and command handlers
---
## Data Model Plan
### 1. Syntax Engine
The syntax engine should be editor-facing and buffer-aware.
Responsibilities:
- attach syntax state to buffers
- initialize parser and query data from filetype
- reparse after edits
- maintain dirty regions or dirty lines
- build cached line highlight results
- expose line results to the renderer
Checklist:
- [ ] define `Engine` interface in `internal/syntax/engine.go`
- [ ] decide whether syntax state is owned directly by the engine or attached to buffers
- [ ] add a field on `editor.Model` for the syntax engine
### 2. Per-buffer Syntax State
Each buffer needs syntax state. The important point is that syntax is buffer-level, not window-level.
Suggested fields:
- [ ] parser
- [ ] language
- [ ] query or compiled query set
- [ ] current parse tree
- [ ] source snapshot or source builder access
- [ ] dirty line or dirty range tracking
- [ ] cached line highlight results
- [ ] version counter for cache invalidation
### 3. Highlight Cache Representation
Start with the representation that makes integration easiest.
Recommended first version:
- cached per-line `[]lipgloss.Style`
Recommended longer-term representation:
- cached per-line spans like `[]Span{StartRune, EndRune, StyleID}`
Implementation choice:
- [ ] phase 1 uses per-rune style maps for easiest renderer integration
- [ ] phase 2 evaluates switching internal cache to spans
### 4. Theme Mapping
Theme logic should map Tree-sitter captures such as `keyword`, `function`, `string`, `comment`, and `type.builtin` to `lipgloss.Style`.
Checklist:
- [ ] create capture-name to style mapping layer
- [ ] support fallback from specific captures to broader categories
- [ ] keep theme logic independent from parser/query logic
---
## Phased Implementation Plan
## Phase 0: Cleanly Commit To Tree-sitter
Purpose:
Remove architectural assumptions that only make sense for Chroma.
Tasks:
- [ ] decide the initial supported filetypes for Tree-sitter
- [ ] decide where query files live and how they are loaded
- [ ] decide whether `main.go` demo code should be removed or moved to a more explicit demo location
- [ ] audit Chroma references in the repo
- [ ] list all codepaths that currently construct or depend on `style.ChromaStyle`
Done when:
- [ ] there is a clear inventory of Chroma-coupled code
- [ ] there is a clear inventory of Tree-sitter assets to load per language
## Phase 1: Introduce Syntax As A Real Subsystem
Purpose:
Create the new architecture boundary before changing rendering behavior.
Tasks:
- [ ] create `internal/syntax`
- [ ] define the engine interface
- [ ] add a syntax engine field to `editor.Model`
- [ ] initialize the syntax engine in model construction
- [ ] remove direct highlighting calls from `view.go`
- [ ] route visible line highlighting through the syntax engine
Done when:
- [ ] `view.go` asks the syntax subsystem for line highlight data
- [ ] syntax work no longer begins inside the render loop itself
## Phase 2: Define Buffer Text Access And Edit Notifications
Purpose:
Make buffer mutations visible to the syntax system in a structured way.
Tasks:
- [ ] decide whether edits are emitted from `core.Buffer` or from editor actions
- [ ] define an internal edit event type
- [ ] include enough data for Tree-sitter incremental edits later
- [ ] wire `SetLine`, `InsertLine`, and `DeleteLine` changes into syntax invalidation
- [ ] decide whether first version uses whole-buffer invalidation
Suggested edit event fields:
- [ ] start byte
- [ ] old end byte
- [ ] new end byte
- [ ] start point
- [ ] old end point
- [ ] new end point
- [ ] affected line range
Done when:
- [ ] syntax invalidation happens when text changes
- [ ] invalidation does not depend on the render loop noticing text changed
## Phase 3: Build Minimal Tree-sitter Registry And Loader
Purpose:
Provide one place that maps filetypes to languages and queries.
Tasks:
- [ ] create a registry for language metadata
- [ ] map filetype strings to Tree-sitter language bindings
- [ ] map filetypes to highlight query file paths
- [ ] load and compile queries once per language where practical
- [ ] define behavior for unsupported filetypes
Done when:
- [ ] opening a supported buffer can resolve a language and query set
- [ ] unsupported buffers degrade cleanly without crashing the renderer
## Phase 4: Implement Full-buffer Parsing And Full-buffer Highlighting
Purpose:
Get correct Tree-sitter highlighting working before optimizing.
Tasks:
- [ ] create per-buffer syntax state
- [ ] build full source text from buffer contents
- [ ] parse full source text into a tree
- [ ] run highlight query across the full tree
- [ ] collect captures in deterministic order
- [ ] resolve overlapping captures consistently
- [ ] convert capture byte ranges into per-line rune-based style maps
- [ ] cache line results for renderer consumption
Done when:
- [ ] a supported filetype can be fully highlighted without Chroma
- [ ] renderer uses cached line results from Tree-sitter
## Phase 5: Rebuild Renderer Integration Around Cached Syntax Data
Purpose:
Simplify the renderer so it consumes syntax cache rather than doing syntax work.
Tasks:
- [ ] redesign line render input around line text plus syntax cache
- [ ] ensure gutter rendering stays independent from syntax rendering
- [ ] ensure cursor overlay works on top of syntax styling
- [ ] ensure visual selection overlay works on top of syntax styling
- [ ] verify blank lines and end-of-line cursor rendering still behave correctly
- [ ] verify window width padding still uses background style consistently
Done when:
- [ ] line drawing is purely a render operation
- [ ] no parser or query logic exists in `view.go`
## Phase 6: Remove Chroma Completely
Purpose:
Delete the old highlighting path and simplify styling around capture-based theming.
Tasks:
- [ ] remove Chroma dependencies from `go.mod`
- [ ] remove `GetLexer`
- [ ] remove `MakeStyleMap`
- [ ] remove `Styles.ChromaStyle` if no longer needed
- [ ] replace Chroma-derived theme extraction with explicit Gim theme definitions
- [ ] update commands that currently switch Chroma styles
Done when:
- [ ] the build no longer depends on Chroma packages
- [ ] no codepath references Chroma tokens, lexers, or styles
## Phase 7: Add Incremental Parsing
Purpose:
Move from correct-but-simple to correct-and-efficient.
Tasks:
- [ ] preserve old trees per buffer
- [ ] call `tree.Edit` before reparsing
- [ ] parse new content using the old tree
- [ ] compute changed ranges
- [ ] decide whether rehighlighting happens by changed byte range, changed point range, or affected line range
- [ ] update only changed cache regions
- [ ] verify cache invalidation around inserted and deleted lines
Done when:
- [ ] small edits do not require full-buffer reparsing and rehighlighting
- [ ] highlighting updates correctly after insertions, deletions, joins, and splits
## Phase 8: Improve Cache Representation If Needed
Purpose:
Reduce memory churn and simplify overlay logic if per-rune style maps become too heavy.
Tasks:
- [ ] measure cost of per-line `[]lipgloss.Style`
- [ ] consider switching internal storage to spans
- [ ] keep renderer-facing API stable if possible
- [ ] optimize only after correctness and incremental behavior exist
Done when:
- [ ] cache format is deliberate rather than inherited from the old renderer
## Phase 9: Expand Language Support
Purpose:
Generalize the system after the first language works well.
Tasks:
- [ ] ship one language first, likely Go
- [ ] add additional language bindings and queries one by one
- [ ] verify filetype detection and registry behavior for each language
- [ ] define how language-specific capture tweaks are handled
Done when:
- [ ] the system can scale beyond a single demo language without architectural changes
## Phase 10: Testing And Verification
Purpose:
Make syntax behavior trustworthy as the engine evolves.
Tasks:
- [ ] add unit tests for registry lookup
- [ ] add unit tests for byte-to-rune range conversion
- [ ] add unit tests for overlapping capture resolution
- [ ] add unit tests for multi-line highlight extraction
- [ ] add integration tests for visible rendering of highlighted lines
- [ ] add edit tests for incremental updates after insert, delete, split, and join operations
- [ ] add tests covering UTF-8 characters and mixed-width content
Done when:
- [ ] syntax bugs can be reproduced and locked down with tests
---
## Suggested Order Of Attack
If working on this piece by piece, this is the recommended order:
- [ ] Phase 1 first
- [ ] Phase 2 second
- [ ] Phase 3 third
- [ ] Phase 4 fourth
- [ ] Phase 5 fifth
- [ ] Phase 6 sixth
- [ ] Phase 7 seventh
- [ ] Phase 10 continuously during all phases
- [ ] Phase 8 only if profiling says it matters
- [ ] Phase 9 after one language is solid
---
## Concrete First Milestone
The first milestone should be intentionally small but architectural.
Milestone goal:
- [ ] create `internal/syntax`
- [ ] add syntax engine field to `editor.Model`
- [ ] make `view.go` consume syntax results instead of computing syntax itself
- [ ] use placeholder or basic full-buffer syntax data, even if the first output is minimal
This milestone matters because it breaks the most important bad dependency: rendering owning syntax.
---
## Concrete Second Milestone
Milestone goal:
- [ ] support one language with Tree-sitter full-buffer parse and full-buffer highlighting
- [ ] cache per-line style results
- [ ] render highlighted output without Chroma
---
## Concrete Third Milestone
Milestone goal:
- [ ] wire edit invalidation into buffer mutation paths
- [ ] update Tree-sitter state after edits
- [ ] keep highlights correct after normal editing commands
---
## Concrete Fourth Milestone
Milestone goal:
- [ ] add true incremental parse updates
- [ ] rehighlight only changed regions
- [ ] validate performance on larger files
---
## Open Design Questions
- [ ] Should syntax state live inside `core.Buffer` or stay in the syntax engine keyed by buffer ID?
- [ ] Should the renderer consume per-rune styles or span-based styles?
- [ ] Should the syntax engine rebuild full source text on demand, or should buffers expose a stable full-text API?
- [ ] How should unsupported filetypes render: plain text or fallback queryless token classes?
- [ ] Should theme capture fallback be static or configurable?
- [ ] Should parser/query assets be embedded or read from disk at runtime?
---
## Notes For Implementation
Guidelines while building this:
- [ ] keep parsing and rendering separate from the first commit
- [ ] optimize only after correctness is established
- [ ] prefer one supported language done correctly over several partial languages
- [ ] keep UTF-8 correctness in mind from the first Tree-sitter integration
- [ ] avoid letting temporary renderer hacks become permanent API boundaries
- [ ] test line split, line join, backspace-at-start, delete-at-end, and multi-line comments early
---
## Definition Of Done
This project is done when all of the following are true:
- [ ] Chroma is gone
- [ ] Tree-sitter is the only syntax engine
- [ ] syntax state is maintained outside rendering
- [ ] edits update syntax state correctly
- [ ] renderer consumes cached syntax data cleanly
- [ ] highlight output is correct for supported languages
- [ ] UTF-8 behavior is correct
- [ ] incremental parsing is working
- [ ] tests cover the risky pieces