Skip to content

feat: OpenKB MVP — Karpathy's LLM Knowledge Base, powered by PageIndex#2

Closed
KylinMountain wants to merge 70 commits into
mainfrom
dev
Closed

feat: OpenKB MVP — Karpathy's LLM Knowledge Base, powered by PageIndex#2
KylinMountain wants to merge 70 commits into
mainfrom
dev

Conversation

@KylinMountain
Copy link
Copy Markdown
Collaborator

Summary

OpenKB is a CLI that implements Karpathy's LLM Knowledge Base workflow — drop documents in, get an auto-maintained, cross-linked wiki out.

Core Features

  • okb init — Interactive setup with model, language, PageIndex config
  • okb add — Two indexing paths: markitdown for short docs, PageIndex for long PDFs (local or cloud)
  • okb query — Streaming Q&A with tool call visibility, PageIndex cloud streaming for long docs
  • okb watch — Filesystem watcher with debounce for auto-compilation
  • okb lint — Structural checks (broken links, orphans, index sync) + LLM knowledge checks
  • okb list / status — Document, summary, concept, and report overview

Architecture

  • Short docs (PDF < 50 pages, docx, html, etc.) → pymupdf dict-mode conversion with inline images → LLM compiles wiki
  • Long docs (PDF ≥ 50 pages) → PageIndex tree index with summaries + text → LLM compiles from summaries
  • Wiki structure: sources/, summaries/, concepts/, explorations/, reports/, index.md, log.md, AGENTS.md
  • PageIndex Cloud support via PAGEINDEX_API_KEY with streaming query
  • Obsidian compatible — plain .md files with [[wikilinks]]

Tech Stack

PageIndex, markitdown, OpenAI Agents SDK, LiteLLM, Click, watchdog

Test Plan

  • 145 unit tests passing
  • E2E: okb add short PDF (attention is all you need) — images inline
  • E2E: okb add long PDF (Introduction to Agents, 54 pages) — PageIndex tree + images
  • E2E: okb add docx
  • E2E: okb query --save with streaming output
  • E2E: okb lint structural + knowledge checks
  • E2E: PageIndex cloud streaming query

Karpathy's LLM Knowledge Base workflow powered by PageIndex for long
document understanding. Covers architecture, two indexing paths
(markitdown for short docs, PageIndex for long docs), wiki compilation
via single LLM agent session with prompt caching, Q&A, watch mode,
linting, CLI commands, and error handling.
Sets up pyproject.toml (hatchling, direct-refs allowed, Python >=3.11),
.gitignore, openkb/__init__.py, a Click CLI stub with all 7 commands
(init, add, query, watch, lint, list, status), and tests/conftest.py
with kb_dir and sample_tree fixtures. Package installs cleanly in a
Python 3.12 venv; okb --help shows all commands; pytest collects 0
tests without error.
Add openkb/config.py (DEFAULT_CONFIG, load_config, save_config),
openkb/state.py (HashRegistry with SHA-256 file hashing and JSON
persistence), and openkb/schema.py (SCHEMA_MD constant). All 17 tests
written first (red) then implemented (green).
Creates full KB directory structure (raw/, wiki/sources/images/,
wiki/summaries/, wiki/concepts/, wiki/reports/), writes SCHEMA.md,
index.md, config.yaml and hashes.json; guards against re-initialisation.
Three tests in tests/test_cli.py cover structure, schema content, and
the already-initialized guard, all via CliRunner.isolated_filesystem.
Implements extract_base64_images and copy_relative_images with full test
coverage for single/multiple images, invalid base64, missing files, and
URL filtering.
Implements ConvertResult dataclass, get_pdf_page_count, and
convert_document with hash-dedup, markdown passthrough, PDF long-doc
detection, MarkItDown conversion, and image extraction integration.
Implements render_source_md and render_summary_md with YAML frontmatter,
recursive heading hierarchy (h1–h6 capped), page ranges, and separate
text/summary views for source and summary wiki pages.
Implements IndexResult dataclass and index_long_document which creates
a LocalClient with full node text/summary/description flags, adds the
PDF via PageIndex, fetches structure, and writes source and summary
wiki pages via the tree renderer.
Implements list_wiki_files, read_wiki_file, and write_wiki_file as plain
functions in openkb/agent/tools.py without @function_tool decoration,
ready to be wrapped when building the agent. Full test coverage including
edge cases for missing files/dirs, filtering to .md only, and parent dir
creation.
Implements build_compiler_agent, compile_short_doc, compile_long_doc in
openkb/agent/compiler.py with function_tool-wrapped wiki tools and
SCHEMA_MD-enriched instructions. Long-doc variant includes get_page_content.
Tests mock Runner.run to avoid real LLM calls.
Replaces the add stub with full orchestration: convert_document,
index_long_document for long PDFs, and compiler agent calls.
Adds SUPPORTED_EXTENSIONS set, _find_kb_dir, _add_single_file helpers.
Adds python-dotenv dependency and load_dotenv() at startup.
Implements pageindex_retrieve (structure -> LLM relevance -> page fetch),
build_query_agent with list/read/retrieve tools, and run_query coroutine.
Wires up `okb query` in cli.py.
Implements DebouncedHandler (collects events, ignores dirs/dotfiles, resets
timer on burst) and watch_directory (Observer loop, Ctrl+C safe).
Wires up `okb watch` in cli.py.
Implements find_broken_links, find_orphans, find_missing_entries,
check_index_sync, and run_structural_lint with full Markdown report.
Covers wikilink resolution, orphan detection, raw/wiki entry matching,
and index.md sync checking.
Implements build_lint_agent with list/read tools and instructions for
semantic quality checks (contradictions, gaps, staleness, redundancy).
run_knowledge_lint runs the agent and returns the report string.
okb lint combines structural + knowledge lint and writes timestamped report.
Tests verify list shows documents table and concepts, status shows
per-directory file counts and total indexed. Both check missing-init guard.
Previously the converter registered the file hash immediately, so if
LLM compilation failed the file was marked as "done" and retries
would skip it. Now the hash is only registered by the CLI after
successful compilation.

Also: install markitdown[all] for PDF support, add python-dotenv.
…pport

- Switch from col._backend.get_document_structure() to col.get_document_structure()
- Add 3x retry for PageIndex indexing (stochastic TOC accuracy)
- Fix storage path to use .db extension
- Remove .doc from supported extensions (markitdown only supports .docx)
- Note: col.get_page_content() still missing from PageIndex public API,
  using col._backend.get_page_content() as workaround
Replace col._backend.get_page_content(col._name, doc_id, spec) with
col.get_page_content(doc_id, spec). Now all PageIndex access uses
public API only.
@KylinMountain
Copy link
Copy Markdown
Collaborator Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

KylinMountain added a commit that referenced this pull request May 24, 2026
Architectural review (4 parallel Opus auditors) found that the skill_runner
core was already generic, but the deck SURFACE was still fused to
Editorial Monocle. Fixed:

* validator: now takes optional `grammar` param (DeckGrammar TypedDict);
  skill-agnostic by default (only checks file present, parses, ≥5
  slides, self-contained). Third-party deck skills (guizang, swiss)
  now pass validation cleanly. Editorial-specific rules opt-in via
  `EDITORIAL_MONOCLE_GRAMMAR`. (finding #2)
* skills/openkb-deck-editorial/SKILL.md: declares its grammar +
  output_path_template under `od:` frontmatter — `run_skill` reads
  these and applies them post-run.
* run_skill: now honors frontmatter `od.mode`, `od.output_path_template`,
  `od.deck_grammar`. When mode=="deck" and template is set, the runner
  injects the path into intent, verifies the file exists post-run, and
  runs validate_deck with the skill's grammar. Validation result is
  returned via new SkillRunResult dataclass. (findings #4, #5)
* `openkb deck new --skill <name>`: CLI flag accepts any installed deck
  skill (default openkb-deck-editorial). guizang and swiss now usable
  from the scripted CLI, not only freeform chat. (finding #1)
* `/deck new --skill <name>` chat slash: same flag, parsed positionally
  alongside --critique. (finding #1)
* tests/test_read_kb_file.py: 13 new tests mirroring test_write_kb_file
  for the read-side allow-list. Pins refusal of `.openkb/config.yaml`,
  `.env`, `raw/`, `..` traversal, absolute paths. (finding #6)
* Generator deck branch: no longer calls validate_deck directly; just
  propagates run_deck_create's SkillRunResult.validation up. Validation
  is now a property of "this skill declared mode=deck", not of "this
  CLI path was taken".

Existing tests updated:
* tests/test_deck_validator.py: explicit grammar arg on Editorial-
  specific tests; added test_guizang_shape_passes_generic_mode +
  test_missing_cover_ignored_in_generic_mode to pin both modes.
* tests/test_deck_creator.py: mocks return SkillRunResult; new
  test_run_deck_create_honors_skill_name_override for --skill flag.
* tests/test_generator.py: deck dispatch test mocks SkillRunResult.

Below-threshold findings deferred:
* Generator if/else → registry (score 70) — works, just not extensible
  via plugin; future.
* Iteration backup in chat freeform path (score 75) — needs write_kb_file
  hook; separate change.
* run_skill / scan_local_skills / _handle_slash_critique direct tests
  (scores 60-70) — covered indirectly by integration; can add later.

Regression: 538 tests pass (was 523 pre-fix; net +15 = 13 new
read_kb_file tests + 2 new validator-mode tests).
KylinMountain added a commit that referenced this pull request May 31, 2026
…lint/status/skill-gate/linter

Add PAGE_CONTENT_DIRS and INDEX_SEED to openkb/schema.py as the single
source of truth; replace duplicated index-seed literals in cli init and
compiler._update_index with INDEX_SEED.

- openkb list / chat /list: add an Entities section (#2)
- lint.check_index_sync: iterate PAGE_CONTENT_DIRS so entities/ pages
  missing from index.md are flagged (#4)
- skill-new gate: count entities/ as compiled content (#5)
- status last-compile: derive from summaries/concepts/entities mtimes (#12)
- semantic linter: read entities/, check contradictions/redundancy/
  coverage/orphans (#3)
KylinMountain added a commit that referenced this pull request Jun 1, 2026
…mpile backfill (#78)

* feat(compiler): _read_entity_briefs for entity plan context

* test(compiler): parity tests for _read_entity_briefs

* feat(compiler): _write_entity with type/aliases frontmatter

* test(compiler): assert source ordering in _write_entity; count=1 in _set_fm_line

Add explicit ordering assertion in test_update_prepends_source_keeps_type
verifying the deterministic json.dumps form ("summaries/b.md", "summaries/a.md").
Pass count=1 to re.sub in _set_fm_line to make first-occurrence intent explicit.

* feat(lint): include entities/ in wikilink whitelist

* feat(compiler): summary<->entity backlinks

* test(compiler): restore assertion erroneously deleted in 3c8aa93

* feat(compiler): index.md Entities section

* feat(compiler): remove_doc_from_entity_pages + index cleanup

* feat(compiler): plan prompt + parser for entities group

Also wires the entity track into _compile_concepts (Tasks 7 + 8 combined,
since the {entity_briefs} placeholder and the _CONCEPTS_PLAN_USER.format call
are co-dependent — splitting would leave an intermediate red state).

- add _ENTITY_TYPES, _filter_entity_items, _parse_entities_plan
- rewrite _CONCEPTS_PLAN_USER to request nested concepts+entities groups
- add _ENTITY_PAGE_USER / _ENTITY_UPDATE_USER prompts
- read entity briefs and pass both briefs to the plan prompt
- parse nested 'concepts' group with legacy flat-list/flat-dict fallbacks
- generate entities in their own asyncio.gather (4-arity tuples)
- strip ghost links + _write_entity each; handle entity related cross-links
- backlink summary<->entities; pass entity_names/entity_meta to _update_index

* fix(compiler): related entities must not downgrade index labels

Mirror the concept track: collect related-entity slugs into a separate
local list used only for backlinks; pass only created/updated entity_names
(+entity_meta) to _update_index.  Defense-in-depth in _update_index: only
_replace_section_entry when name is in entity_meta, otherwise only insert
if the link is absent, so a related-only entity can never clobber a
pre-existing correct (type + brief) index line with "(other)".

Adds regression test test_related_entity_does_not_downgrade_index_label.

* feat(schema): declare entities/ page type and taxonomy

* feat(query): point who/what questions at entities/

* docs(readme): document entities/ page type

* feat(cli): scaffold entities/ in init and count it in status

- `openkb init` now creates wiki/entities/ alongside wiki/concepts/
- init seed index.md gains ## Entities between ## Concepts and ## Explorations,
  matching the _update_index template in compiler.py
- print_status subdirs list gains "entities" after "concepts"
- Tests updated: assert wiki/entities/ exists and index.md contains ## Entities;
  status test asserts "entities" appears in output

* fix(compiler): resolve entity-page review findings (dangling links + dedup)

Addresses code-review findings on the entity-pages feature:

- Fix dangling wikilink after `openkb remove`: entity removal now strips
  standalone `See also: [[summaries/{doc}]]` lines (the related-entity
  backlink form), matching the concept path, and cli.py adds modified
  entity pages to the lint sweep scope so surviving pages are cleaned.
- Unify the parallel concept/entity helpers into shared cores
  (_backlink_summary_pages, _backlink_pages, _remove_doc_from_pages) with
  thin per-type wrappers, so cleanup logic can no longer drift between the
  two page types (this is what caused the dangling-link bug).
- Route related-entity cross-refs through _add_related_link (now page-type
  aware) instead of an inline reimplementation — removes a duplicate file
  read/write and keeps backlink creation symmetric with teardown.
- Centralize the entity-type enum: prompts derive their type list from a
  single _ENTITY_TYPE_LIST source via import-time substitution.
- Count entity items in the "all dropped as malformed" plan warning.
- Drop the unreachable else branch in _update_index's entity loop.
- Add regression test for the See-also strip on a surviving entity page.

All 542 tests pass.

* fix(compiler): add [[entities/X]] whitelist rule + restore concept-topic guard

Remaining review findings after a7a06ed:
- _KNOWN_TARGETS_USER now states the [[entities/Z]] rule, so entity links
  the LLM is told to write aren't silently stripped as ghosts.
- Restore the dropped 'Do NOT create concepts that are just the document
  topic itself' plan rule to prevent redundant title-mirror concepts.

* feat(entities): shared page-dir constants + surface entities in list/lint/status/skill-gate/linter

Add PAGE_CONTENT_DIRS and INDEX_SEED to openkb/schema.py as the single
source of truth; replace duplicated index-seed literals in cli init and
compiler._update_index with INDEX_SEED.

- openkb list / chat /list: add an Entities section (#2)
- lint.check_index_sync: iterate PAGE_CONTENT_DIRS so entities/ pages
  missing from index.md are flagged (#4)
- skill-new gate: count entities/ as compiled content (#5)
- status last-compile: derive from summaries/concepts/entities mtimes (#12)
- semantic linter: read entities/, check contradictions/redundancy/
  coverage/orphans (#3)

* feat(entities): remove preview lists entity-page actions (#1)

The dry-run/confirmation block now scans wiki/entities/ with the same
frontmatter sources: logic as concepts, emits DELETE/MODIFY action lines
per entity page, and prints an 'N entity(s) will be DELETED' summary.
Execution path (remove_doc_from_entity_pages) unchanged.

* docs(entities): document entity pages in shipped openkb skill (#8)

Note wiki/entities/ holds named-thing pages (people/orgs/places/
products/works/events) with a type: frontmatter field, that index.md
has a ## Entities section, and that 'who/what is X' questions should
read the matching entities/ page first.

* fix(compiler): don't write raw JSON body on empty LLM content

In the parse-succeeded branch of _gen_create/_gen_update/_gen_entity_create/
_gen_entity_update, fall back to "" instead of the raw JSON string when the
content field is empty/null. _require_nonempty_content then raises and the
page is dropped, rather than writing the JSON envelope as the markdown body.
The parse-FAILED (except) branch keeps content=raw as the legitimate
non-JSON fallback.

* fix(compiler): graceful scalar plan + rebuild malformed entity frontmatter

- _compile_concepts: guard a non-dict/non-list parsed plan (JSON scalar)
  before calling .get(), taking the empty-plan path (write v1 summary if
  applicable + update index + return) instead of risking AttributeError.
- _write_entity: when an existing page has an opening --- but no closing
  delimiter (or no frontmatter), rebuild valid sources/type/brief frontmatter
  rather than writing a body-only page that drops the metadata.

* fix(compiler): keep ## Entities before ## Explorations; drop dead param + overlap gathers

- _update_index: insert ## Entities before ## Explorations on older index.md
  files that predate the section (new _ensure_h2_section_before helper),
  preserving canonical order instead of appending at EOF.
- _filter_entity_items: drop the unused 'label' parameter and update call
  sites in _parse_entities_plan.
- _compile_concepts: overlap concept and entity generation in one outer
  asyncio.gather (they share cached context and the same concurrency
  semaphore); result/error handling per list is unchanged.

* test(compiler): cover empty-content skip, scalar plan, malformed entity FM, Entities order

Add regression tests for the four compiler fixes:
- empty {"content":""} response skips the page (no raw JSON body)
- JSON scalar plan handled gracefully (no AttributeError)
- _write_entity rebuilds frontmatter when closing --- is missing
- _update_index inserts ## Entities before ## Explorations

* fix(compiler): silence spurious 'hand-edited' warning on backlink section creation

_backlink_summary_pages / _backlink_pages create ## Entities / ## Related
Documents sections as a normal first-time operation; pass quiet=True so
_ensure_h2_section no longer logs the index-drift warning in that case.
Index-repair callers keep the warning.

* feat(cli): add `recompile` command to re-run compile on indexed docs

Re-runs the current compile_short_doc/compile_long_doc pipeline on
already-indexed docs so pre-feature KBs gain the entities/ layer and
refresh to the current format. Reuses on-disk sources/summaries and the
registry's PageIndex doc_id — does not re-index or re-convert.

Supports a positional <doc_name> (resolved via _resolve_doc_identifier)
or --all (with a regeneration-warning confirmation, bypassed by --yes),
--dry-run (enumerate only, no LLM calls/writes), and --refresh-schema
(back up + overwrite wiki/AGENTS.md when it differs from AGENTS_MD).
Processes docs sequentially with per-doc progress, skips+warns on
missing sources / summaries / doc_id, prints a recompiled/skipped
summary, and appends a recompile entry to log.md.

* test(cli): recompile dispatch/dry-run/skip/refresh-schema

* docs(readme): document openkb recompile

* fix(cli): recompile --refresh-schema no-ops when AGENTS.md absent; tighten guard tests

Match the spec (and the helper's own docstring): _refresh_schema returns
early when wiki/AGENTS.md is missing rather than materializing the default
(get_agents_md already falls back to it at runtime). Tighten the doc/--all
guard tests to assert the exact message + that no compile runs, and add the
missing-AGENTS.md no-op test.

* fix(compiler): drop non-existent 'related' slugs so they don't create dangling links

The plan's 'related' list is meant to reference existing pages, but the LLM
sometimes lists slugs for pages that don't exist. Those were added to the
wikilink whitelist (so body references survived ghost-stripping) and
back-linked into the summary's Related section, yet no page was ever created
(related items are linked, never generated) — producing a flood of broken
[[concepts/...]] / [[entities/...]] links (esp. on feature-dense docs).
Filter related_items / entity_related to slugs that exist on disk.

* fix: remove-preview detects JSON-quoted sources; _write_entity preserves sources on malformed FM

- remove --dry-run preview parsed the sources list with a hand-rolled comma
  split that kept JSON quotes (["summaries/x.md"]), so the marker never
  matched and the preview always reported 0 affected concept/entity pages
  (executor was correct). Extract _scan_affected_pages using the real
  _parse_yaml_list_value; dedups the two copied scan loops too.
- _write_entity's malformed-frontmatter rebuild seeded sources with only the
  new doc, dropping prior sources for multi-source entities. Recover existing
  sources from the broken block and merge.
Both bugs were masked by tests using unquoted / single-source fixtures.

* feat(cli): rename remove --keep-empty-concepts → --keep-empty (covers entities too)

This PR wired entity pages into 'openkb remove', so the flag now governs
concept AND entity retention — but the name still said 'concepts'. Make
--keep-empty the canonical name (clear that it covers both), keep
--keep-empty-concepts as a backward-compatible alias, and update the
preview/summary messages, docstring, and README accordingly.

* feat(compiler): config-driven entity types (entity_types overrides the default enum)

Add an optional 'entity_types:' key in .openkb/config.yaml. When present it
overrides the default person/organization/place/product/work/event/other
vocabulary everywhere — the plan prompt, the entity-page prompts, and
create/update validation/coercion; when absent, behavior is byte-identical.

Prompt templates keep an __ENTITY_TYPES__ token now substituted at call time
(per-KB) inside _compile_concepts, and the resolved valid-type set is threaded
into _parse_entities_plan / _filter_entity_items and the _gen_entity_* coercion.
'other' is always ensured as the coercion fallback; malformed config falls back
to the default with a warning. Documented in config.yaml.example + README.

* fix(compiler): harden config-driven entity types (crash-proof + complete the override)

Review of the config-entity-types feature surfaced two real issues:
- A config 'entity_types' value containing '{' or '}' was substituted into the
  prompt template BEFORE .format() ran → KeyError/ValueError crashing every
  compile. Swap to format-then-replace at all 3 call sites (types_str is now an
  inert literal), and sanitize resolved types to a safe label charset (also
  skips YAML nulls/ints so str(None) can't become the type 'none').
- The AGENTS_MD system schema hardcoded 'type: is one of: <7 defaults>',
  contradicting a custom entity_types in the higher-weight system message.
  Reword it to frame those as the configurable default and defer the
  authoritative set to the compilation prompt (which is config-driven).
Also drop the now-dead _ENTITY_TYPES_STR + its stale import-time-substitution
comment. +2 regression tests (sanitization; brace-in-type doesn't crash).

* refactor: move entity-type resolution to config layer + co-locate remove-preview scan

Altitude cleanups from the review:
- Move resolve_entity_types + DEFAULT_ENTITY_TYPES into openkb/config.py (the
  config layer owns config validation/normalization; any command can reuse it
  without importing the heavy compiler module). compiler.py imports them;
  _ENTITY_TYPE_LIST/_ENTITY_TYPES remain as the default alias/validation set.
- Move the remove dry-run preview scan from cli.py into compiler.py as
  scan_affected_pages, beside remove_doc_from_*_pages and sharing
  _parse_yaml_list_value — so preview and executor can't drift on how the
  sources list is parsed (root cause of the earlier JSON-quote preview bug).

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants