This document covers how PHPantom discovers, parses, and caches class definitions across the workspace. The goal is to remain fast and lightweight by default while offering progressively richer modes for users who want exhaustive workspace intelligence.
Items are ordered by impact (descending), then effort (ascending) within the same impact tier.
| Label | Scale |
|---|---|
| Impact | Critical, High, Medium-High, Medium, Low-Medium, Low |
| Effort | Low (≤ 1 day), Medium (2-5 days), Medium-High (1-2 weeks), High (2-4 weeks), Very High (> 1 month) |
PHPantom has three byte-level scanners (no AST) for early-stage file discovery:
- composer-classmap — parses Composer's
autoload_classmap.phpinto an in-memoryHashMap<String, PathBuf>. - PSR-4 scanner (
find_classes) — walks PSR-4 directories fromcomposer.jsonand extracts class FQNs with namespace compliance filtering. - full-scan (
find_symbols) — walks files and extracts classes, standalone functions,define()constants, and top-levelconstdeclarations in a single pass.
These scanners serve three scenarios at startup:
| Scenario | Class discovery | Function & constant discovery |
|---|---|---|
| Composer project (classmap complete) | composer-classmap | autoload_files.php byte-level scan + lazy parse |
| Composer project (classmap missing/incomplete) | PSR-4 scanner + vendor packages | autoload_files.php byte-level scan + lazy parse |
No composer.json |
full-scan on all workspace files | full-scan on all workspace files |
Monorepo (no root composer.json, subprojects found) |
Per-subproject: composer-classmap or PSR-4 + vendor packages. Loose files: full-scan with skip set | Per-subproject: autoload_files.php byte-level scan + lazy parse. Loose files: full-scan |
The "no composer.json" path is fully lightweight: find_symbols
populates classmap, autoload_function_index, and
autoload_constant_index in one pass, and lazy update_ast on first
access provides complete FunctionInfo/DefineInfo. All directory
walkers (full-scan, PSR-4 scanner, vendor package scanner, and
go-to-implementation file collector) use the ignore crate for
gitignore-aware traversal instead of hardcoded directory name
filtering. Hidden directories are skipped automatically.
The monorepo path activates when there is no root composer.json but
discover_subproject_roots finds subdirectories with their own
composer.json files. Each subproject is processed through the full
Composer pipeline (PSR-4, classmap, vendor packages, autoload files)
and results are merged into the shared backend state. Loose PHP files
outside subproject trees are picked up by the full-scan walker with a
skip set that prevents double-scanning subproject directories. See
the ARCHITECTURE.md Composer Integration section for full details.
Find References parses files in parallel via std::thread::scope.
Go-to-Implementation walks classmap files sequentially.
Four indexing strategies, selectable via .phpantom.toml:
[indexing]
# "composer" (default) - merged classmap + self-scan
# "self" - always self-scan, ignore composer classmap
# "full" - background-parse all project files for rich intelligence
# "none" - no proactive scanning
strategy = "composer"Merged classmap + self-scan. Load Composer's classmap (if it exists) as a skip set, then self-scan all PSR-4 and vendor directories for anything the classmap missed. Whatever the classmap already covers is a free performance win; whatever it's missing, we find ourselves. No completeness heuristic needed. This is the zero-config experience.
Always build the classmap ourselves. Ignores autoload_classmap.php
entirely. Equivalent to the merged approach with an empty skip set.
For users who prefer PHPantom's own scanner or who are actively
editing composer.json dependencies.
Background-parse every PHP file in the project. Uses Composer data to guide file discovery when available, falls back to scanning all PHP files in the workspace when it is not. Populates the ast_map, symbol_maps, and all derived indices. Enables workspace symbols, fast find-references without on-demand scanning, and rich hover on completion items. Memory usage grows proportionally to project size.
No proactive file scanning. Still uses Composer's classmap if present,
still resolves classes on demand when the user triggers completion or
hover, still has embedded stubs. The only difference from "composer"
is that it never self-scans to fill gaps.
Goal: Keep the classmap fresh without user intervention.
- On
workspace/didChangeWatchedFiles: ifcomposer.jsonorcomposer.lockchanged, schedule a rescan of vendor directories (the user likely rancomposer installorcomposer update). - On
did_saveof a PHP file: if the file is in a PSR-4 directory, do a targeted single-file rescan (read the file, extract class names, update the classmap entry). This is cheap enough to do synchronously.
For single-file changes, re-scan only that file and update/remove its classmap entries. No need to rescan the entire workspace.
For dependency changes (vendor rescan), this is the expensive case but happens rarely (a few times per day at most).
Goal: Speed up workspace-wide operations (find references, go-to-implementation, self-scan, diagnostics) by processing files in parallel with priority awareness.
All prerequisites (RwLock, Arc<String>, Arc<SymbolMap>) are
complete.
ensure_workspace_indexed (used by find references) now parses files
in parallel via two helpers in references/mod.rs:
parse_files_parallel— takes(uri, Option<content>)pairs, loads content viaget_file_contentwhen not provided, splits work into chunks, and parses each chunk in a separate OS thread.parse_paths_parallel— takes(uri, PathBuf)pairs, reads files from disk and parses them in parallel.
Both use std::thread::scope for structured concurrency (all threads
join before the function returns). The thread count is capped at
std::thread::available_parallelism() (typically the number of CPU
cores). Batches of 2 or fewer files skip threading overhead.
Transient entry eviction after GTI and find references has been
removed. Parsed files stay cached in ast_map, symbol_maps,
use_map, and namespace_map so that subsequent operations benefit
from the work already done. This trades a small amount of memory for
faster repeat queries and simpler code.
Self-scan classmap building (scan_psr4_directories,
scan_directories, scan_vendor_packages,
scan_workspace_fallback_full) now uses a two-phase approach:
directory walks collect file paths first (single-threaded), then files
are read and scanned in parallel batches via std::thread::scope.
Three parallel helpers in classmap_scanner.rs cover the three scan
modes: scan_files_parallel_classes (plain classmap),
scan_files_parallel_psr4 (PSR-4 with FQN filtering), and
scan_files_parallel_full (classes + functions + constants). Small
batches (≤ 4 files) skip threading overhead.
The byte-level PHP scanner (find_classes, find_symbols) uses
memchr SIMD acceleration to skip line comments, block comments,
single-quoted strings, double-quoted strings, and heredocs/nowdocs
instead of scanning byte-by-byte. This reduces per-file scanning time
for files with large docblocks or string literals.
The following are deferred to a later sprint:
- Priority-aware scheduling. Interactive requests (completion, hover, go-to-definition) should preempt batch work. Currently all threads run at equal priority.
- Parallel classmap scanning in
find_implementors. Phase 3 offind_implementorsreads and parses many classmap files sequentially. Parallelizing this requires care because it interleaves reads and writes throughclass_loadercallbacks. memmap2for file reads. Avoids copying file contents into userspace when the OS page cache already has them.- Parallel autoload file scanning. The
scan_autoload_fileswork queue is inherently sequential due torequire_oncechain following, but the initial batch of files could be processed in parallel before following chains.
rayon is the obvious choice for "process N files in parallel" and
Libretto uses it successfully. But it runs its own thread pool
separate from tokio's runtime. When rayon saturates all cores on a
batch scan, tokio's async tasks (completion, hover, signature help)
get starved for CPU time. There is no clean way to pause a rayon
batch when a high-priority LSP request arrives.
The classmap is a convenience for O(1) class lookup and class name completion. But most resolution already works on demand via PSR-4 (derive path from namespace, check if file exists). Class name completion is a minor subset of what users actually trigger. This means classmap generation can run at normal priority without blocking the user. They can start writing code immediately while the classmap builds in the background.
Goal: Show type signatures, docblock descriptions, and deprecation info in completion item hover without parsing every possible class up front.
When completion shows SomeClass::doThing(), hovering over that item
in the completion menu shows nothing because we haven't parsed
SomeClass's file yet. Parsing it on demand would be fine for one
item, but the editor may request resolve for dozens of items as the
user scrolls.
Use completionItem/resolve to populate detail and
documentation fields. If the class is already in the ast_map (parsed
during a prior resolution), return the full signature and docblock.
If not, return just the item label with no extra detail.
In "full" mode, everything is already parsed, so every completion
item gets rich hover for free. In "composer" / "self" mode, items
that happen to have been resolved earlier in the session get rich
detail; others don't. This is a graceful degradation that never blocks
the completion response.
When a completion list is generated, queue the unresolved classes for background parsing at low priority. If the user lingers on the completion menu, resolved items will progressively gain detail. This is a nice-to-have, not a requirement.
Goal: Parse every PHP file in the project in the background, enabling workspace symbols, fast find-references without on-demand scanning, and complete completion item detail.
Prerequisites (from performance.md):
Arc<ClassInfo>(P1). Full indexing stores aClassInfofor every class in the project. WithoutArc, every resolution clones the entire struct out of the map. WithArc, retrieval is a reference-count increment. This is the difference between full indexing using ~200 MB vs. ~500 MB for a large project.
When strategy = "full" is set in .phpantom.toml.
Full mode is not a separate discovery system. It works exactly like
"self" mode (self-scan) and then schedules a second pass:
- First pass (same as self): Build the classmap via byte-level scanning. This completes in about a second and gives us class name completion and O(1) file lookup.
- Second pass: Iterate every file path in the now-populated
in-memory classmap and call
update_aston each one at Low priority. This populates ast_map, symbol_maps, class_index, global_functions, and global_defines.
No new file discovery logic is needed. The classmap from the first pass already contains every relevant file path. The second pass just enriches it.
When composer.json does not exist (e.g. the user opened a monorepo
root or a non-Composer project), the first pass falls back to walking
all PHP files under the workspace root, so the second pass still has
a complete file list to work from.
The user experiences three stages:
- Immediate: LSP requests are up and running. Completion, hover, and go-to-definition work via on-demand resolution and stubs.
- Seconds: Classmap is ready. Class name completion covers the full project. Cross-file resolution is O(1).
- Under a minute: Full AST parse complete. Workspace symbols, fast find-references (no on-demand scanning), rich hover on completion items.
Each stage improves on the last without blocking the previous one.
- Respect the priority system from X2: pause the second pass when higher-priority work arrives.
- Process user code first, then vendor.
- Report progress via
$/progresstokens so the editor can show "Indexing: 1,234 / 5,678 files". The$/progressinfrastructure (token creation, begin/report/end helpers) is already in place and used during workspace initialization. The second pass just needs to callprogress_reportas it processes each file.
Currently we store ClassInfo, FunctionInfo, and SymbolMap
structs that are not as lean as they could be. For a 21K-file
codebase, full indexing will use meaningful RAM. This is acceptable
because it's an opt-in mode, but we should profile and trim struct
sizes over time. The aim is to stay under 512 MB for a full project.
The performance prerequisites above (P1 Arc<ClassInfo>,
Arc<String>, Arc<SymbolMap>) directly reduce memory usage by
sharing data across the ast_map, caches, and snapshot copies instead
of deep-cloning each. These should be measured before and after to
validate the 512 MB target.
With the full index populated, workspace/symbol becomes a simple
filter over the ast_map and global_functions maps. No additional
infrastructure needed.
In other modes, workspace symbols still works but only returns results
from already-parsed files (opened files, on-demand resolutions, stubs).
When the user invokes workspace symbols outside of full mode, show a
one-time hint suggesting they enable strategy = "full" in
.phpantom.toml for complete coverage.
Impact: Medium · Effort: Medium
All progress indicators in PHPantom currently report coarse milestone percentages (e.g. 10%, 20%, 70%) or just begin/end with no intermediate updates. On large codebases the user sees the bar jump from 0% to "done" with no feedback in between. This affects three areas:
- Workspace indexing.
init_single_projectreports four hardcoded milestones (10/20/70/100). The actual scanning work (build_self_scan_composer,scan_autoload_files) runs between those milestones with no per-file reporting. - Monorepo indexing.
init_monorepodivides 10..80 across subprojects, which gives per-subproject granularity. But within each subproject the scanning is opaque. The loose-file scan (80..95) has no file-level reporting either. When new subprojects are discovered during a directory walk, the total should grow dynamically so the percentage reflects actual progress. - Go to Implementation and Find References. These report
begin/end only. The underlying scans (
find_implementors,find_member_references,find_class_references) iterate over classmap files, ast_map entries, and PSR-4 directories, all of which have known or discoverable totals.
Introduce a lightweight progress callback that the scanning functions accept optionally:
/// Progress callback: (completed, total, phase_label).
/// `total` may increase during scanning as new files are
/// discovered (e.g. PSR-4 directory walk).
type ProgressFn = dyn Fn(u32, u32, &str) + Send + Sync;The caller (the async handler) creates the callback, which
captures the progress token and a last_report timestamp. The
callback checks Instant::elapsed and only sends a
WorkDoneProgressReport when at least 100 ms have passed since
the last report, avoiding notification spam.
build_self_scan_composerand the parallel scan helpers (scan_files_parallel_classes,scan_files_parallel_full) accept an optional&ProgressFn. The total is the number of files to scan (known up front from the directory walk). Each file processed increments the completed count.scan_autoload_filesreports per-file progress through the same callback.- For monorepo mode, the per-subproject percentage range (currently 10..80) can be subdivided by file count within each subproject. If the loose-file scan discovers additional files beyond the initial estimate, the total grows and the percentage recalculates.
find_implementors has five phases with different totals:
Phase 1 (ast_map), Phase 2 (class_index), Phase 3 (classmap
files), Phase 4 (stubs), Phase 5 (PSR-4 walk). The total for
each phase is known before iteration begins. Report progress
within each phase and allocate percentage ranges across phases
proportional to expected cost (Phase 3 dominates).
find_member_references, find_class_references, and
find_function_references iterate over user_file_symbol_maps().
The snapshot size is known up front. Report per-file progress
with 100 ms throttling.
The progress callback must work from both the async runtime and
spawn_blocking threads. Since the callback captures a
tokio::sync::mpsc::Sender or uses std::sync::Mutex for the
timestamp and token, it can be invoked from any thread. The
actual notification send happens on the async side (the receiver
drains the channel and calls progress_report).
A simpler alternative: since GTI and Find References already run
on spawn_blocking, the callback can write to a shared
Arc<AtomicU32> for completed/total, and a separate
tokio::spawn task polls it every 100 ms and sends reports.
This avoids threading async senders into sync code.
Goal: Persist the full index to disk so that restarts don't require a full rescan.
Only if X4 background indexing is slow enough on cold start that users complain. Given that:
- Mago can lint 45K files in 2 seconds.
- A regex classmap scan over 21K files should be sub-second.
- Full AST parsing of a few thousand user files should take single digit seconds.
...disk caching may never justify its complexity. The primary use case would be memory savings (load from disk on demand instead of holding everything in RAM), not startup speed.
bincode/postcard: simple, small dependency footprint, tolerant of struct changes (deserialization fails gracefully instead of reading garbage memory). The right default choice.- SQLite: robust, queryable, but heavier than needed for a flat key-value store.
Zero-copy formats like rkyv are ruled out. They map serialized bytes
directly into memory as if they were the original structs, which means
any struct layout change between versions reads corrupt data. PHPantom's
internal types change frequently and will continue to do so. A cache
format that silently produces garbage after an update is worse than no
cache at all.
Store file mtime + content hash per entry. On startup, walk the
directory, compare mtimes, re-parse only changed files. This is
Libretto's IncrementalCache approach and it works well.
Implement disk caching only if:
- Full-mode cold start exceeds 10 seconds on a representative large codebase, AND
- The memory overhead of holding the full index exceeds the 512 MB target, or users on constrained systems report issues.
If neither condition is met, skip this phase entirely. Simpler is better.
Impact: Medium · Effort: Medium
The current lazy-loading design provides an implicit recency signal:
classes in ast_map were loaded because the developer interacted with
their file during this session (hovered, navigated, completed). Source
tiers 0 (use-imported) and 1 (same-namespace) already capture this
for the current file's neighborhood. The class_index source captures
cross-file interactions (go-to-definition, hover, or completion that
triggered a load).
This implicit signal works because unloaded classes are in a separate bucket (classmap/stubs, tier 2) with lower priority. If the project moves to eager indexing (X4, parsing all files at startup), every class would appear equally "loaded" and the tier distinction would collapse. The same-namespace tier would contain every class in the namespace, not just the ones the developer recently touched.
When to implement: Only when eager loading (X4) ships. Until then, the implicit signal is sufficient and no explicit tracking is needed.
Design sketch:
-
Track accepted completions. When the editor sends
completionItem/resolveor the nextdidChangecontains text matching a recently offered completion, record the FQN and a timestamp. -
Track navigation. When go-to-definition or hover resolves a class, record the FQN.
-
Score decay. Use an exponential decay function so that a class used 5 minutes ago scores higher than one used 2 hours ago, but both score higher than one never interacted with.
-
Integration with sort key. The recency score could replace the source tier dimension (since tier 0/1/2 distinctions become less meaningful with full indexing) or be added as a new dimension between affinity and demotion. The sort_text scheme is documented in ARCHITECTURE.md § Class Name Sources and Priority.
-
Persistence. The recency table can be in-memory only (reset on server restart). Cross-session persistence is a nice-to-have but not essential; the affinity table already provides a good cold-start ordering.
Impact: Medium-High · Effort: Medium
Find-references currently scans every span in every file's SymbolMap
— O(total_spans_in_project) per request. For a 10K-file project
with ~500 spans/file that is ~5 million comparisons, each involving a
resolved_names lookup and string comparison. The result set is
typically tiny (tens of locations), so almost all work is wasted.
php-lsp builds a background reference index (Phase 3) that maps each symbol's FQN to its reference locations, giving O(k) lookups where k = number of actual references.
A new inverted index on Backend:
ref_index: HashMap<String, Vec<(String, u32, u32)>>
Key: target FQN (e.g. App\Models\User, App\Models\User::save).
Value: list of (file_uri, span_start, span_end) tuples.
-
Bulk build.
ensure_workspace_indexedalready parses every file and buildsSymbolMapentries with resolved names. After that loop completes, a second pass over all symbol maps populates the inverted index. This is the same work find-references does today, but done once instead of on every request. -
Incremental update.
update_ast_inneralready rebuilds a file'sSymbolMap. After replacing the symbol map entry, remove the old file's contributions from the inverted index and insert the new ones. Since entries are keyed by FQN, this is a scan of the old/new symbol map for the single changed file — negligible cost.
find_class_references, find_member_references, and
find_function_references look up the target FQN in the inverted
index. If the index is populated, return the stored locations
directly (converting span offsets to LSP ranges via the file content).
If the index is not yet built (workspace not indexed), fall back to
the current full-scan approach.
Benefits from X4 (full background indexing) — if the workspace is indexed eagerly, the inverted index is always warm. Without X4, the index is built lazily on first find-references and kept warm afterward.
After X4 ships (the index is most valuable when the workspace is fully indexed). Can also land independently — the lazy-build fallback makes it safe to ship without X4.