Post

LLMApps - Karpathy LLM Knowledge Base with Obsidian

Karpathy LLM Knowledge Base — Obsidian Implementation

Table of contents:


Overview

LLM-KB is an architecture where the LLM incrementally builds and maintains a persistent wiki — rather than re-deriving answers from raw documents at query time (as in RAG). Knowledge accumulates and compounds over time.

Coined/popularized by Andrej Karpathy (2026). Source: GitHub Gist

Division of labor:

RoleJob
HumanCurate sources, ask questions, direct analysis
LLMWrite wiki pages, update cross-references, maintain consistency, file answers

不要只做研究,要养研究资产。

  • “做研究”是一次性的——看了一份报告,做了一个判断,然后这个判断就埋在聊天记录里了。
  • “养资产”是持续的——每一份报告都变成知识库的一部分,每一个判断都有证据链,每一次市场验证都更新了你对这家公司的理解。

Examples:

做研究养研究资产
看完研报,记个要点更新到公司页,关联到主题页
买入时心里有想法写成thesis,列出证据和反证
跌了,”拿着吧”检查论据是被强化了还是被削弱了
卖了,结束复盘页:为什么买、为什么卖、学到什么
再看同一只股票,从头来打开公司页,所有历史研究都在

把信息、论据、量化验证、风险管理和执行建议连成一个闭环。

  • 信息层:新闻、财报、行情数据的结构化整理
  • 论据层:为什么看多/看空,证据是什么,证据有没有过期
  • 验证层:历史统计、因子回测、风险暴露分析
  • 执行层:在什么位置买、仓位多大、什么条件退出
  • 复盘层:事后追踪,判断有没有失真这不是一个”选股工具”。
  • 让投资判断从”拍脑袋”变成”可追踪、可校验、可迭代”的系统。让认知可以复利。

Why Not RAG

RAGLLM-KB
Re-discovers knowledge from scratch on every queryKnowledge accumulates over time
Requires vector DB + embedding infrastructureJust markdown files in a folder
Passive — retrieval onlyActive — LLM maintains the knowledge base
Subtle cross-document synthesis re-queried every timeSynthesis filed once, referenced forever
Not human-readable at the retrieval layerEverything is human-readable and directly editable

RAG: good for large, static document corpora you don’t want to re-process. LLM-KB: better for knowledge that accumulates, evolves, and requires repeated synthesis.


3-Layer Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Layer 1: Raw Sources (Immutable)
  ├── Articles, papers, web clips, PDFs
  ├── LLM reads — NEVER edits
  └── Source of truth

Layer 2: Wiki Pages (LLM-owned)
  ├── Summaries, concept pages, entity pages, comparisons
  ├── LLM writes and maintains everything here
  └── Human can read, browse, and correct

Layer 3: Schema (Co-evolved)
  ├── Config doc: wiki structure, conventions, operations
  ├── Evolved by human + LLM over time
  └── Lives as SCHEMA.md or CLAUDE.md in the wiki root

Karpathy的系统有三层,非常清晰:

  • 第一层:原始资料
    • 财报、研报、新闻、电话会纪要、行情数据、聊天记录。
    • 这些是“只读”的—进来了就不改,永远当原始证据保存。
  • 第二层:知识库(Wiki)
    • 这是核心。Al 读完原始资料后,不是简单做个摘要就完事。而是:
    • 给每家公司建一个”档案页”:基本面、看多逻辑、风险、催化剂、退出条件给每个投资主题建一个”主题页”:AI算力周期、存储景气度、港股高息、美联储政策
    • 给每个策略建一个”策略页”:动量策略适合什么市场、全天候组合怎么配
    • 给每个持仓建一个”论据页”:我为什么买的、证据是什么、什么情况下论据失效
    • 最关键的是——这些页面之间互相连美光的公司页链接到存储周期的主题页,存储周期链接到AI算力主题页,AI算力又链接到英伟达和台积电。新的财报出来,不只是更新一个页面,而是沿着这些连接,把相关的页面全部同步更新。
    • Karpathy自己的原话:
    • “一个新资料进来,可能要触及10-15个Wiki页面。”这才是“知识复利”的意思:不是存了多少,而是连接了多少。
  • 第三层:规则文件
    • 一份告诉AI”怎么维护这个知识库”的说明书。
      • 什么格式
      • 什么命名规范
      • 新资料进来怎么处理
      • 多久做一次”健康检查”
      • 哪些矛盾需要标记
    • 相当于给AI立了一套投研纪律。

3 Core Operations

Ingest 入库

Drop a new source → tell LLM to process it:

  1. Read source file
  2. Extract key information and entities
  3. Write or update wiki pages in pages/
  4. Update pages/index.md — add new pages with one-line summaries
  5. Append to log.md:
1
2
3
## [YYYY-MM-DD] ingest | Source Title
- Pages created: [[Page Name]]
- Pages updated: [[Other Page]]

One source may touch 5–15 wiki pages — entities, concepts, topics, and cross-references. Prefer updating existing pages over creating new ones for incremental additions.

举个例子:一份美光季报出来了。 AI 不只是写个”美光Q2收入beat”的摘要。它还会去检查—之前存储周期主题页里写的“库存正在去化”,和这份财报里的库存数据一致吗?之前的看多论据”DRAM价格趋势向上”,这次财报支持还是削弱了?

Query 查询

Ask questions against the accumulated wiki:

  1. LLM reads index.md → identifies relevant pages
  2. Reads relevant pages
  3. Synthesizes answer with citations ([[wikilinks]])
  4. Files substantive answers back as new wiki pages — comparisons, analyses, and connections should not disappear into chat history

Output formats: markdown page, comparison table, Marp slide deck, matplotlib chart, Obsidian canvas.

Lint 健康检查

Periodic health check (on-demand or scheduled):

  1. Find contradictions between pages
  2. Find orphan pages (no inbound links) — flag or connect
  3. Find concepts mentioned but lacking their own page — create stubs
  4. Find stale claims superseded by newer sources
  5. Suggest 2–3 new questions or sources to investigate
  6. Append to log.md:
1
2
3
4
## [YYYY-MM-DD] lint | Wiki Health Check
- Issues found: ...
- Pages created: ...
- Recommendations: ...

Two Special Files

index.md — Content Catalog

  • Every wiki page listed with a [[wikilink]] and one-line summary
  • Organized by category (Concepts, Security, Cloud, AI, etc.)
  • LLM reads this first on every query
  • Updated on every ingest — never let it go stale
1
2
3
4
5
## AI & Machine Learning
| Page                   | Summary                                                       |
| ---------------------- | ------------------------------------------------------------- |
| [[AI/LLM]]             | LLMs: prompting, fine-tuning, OWASP LLM Top 10, RAG, security |
| [[LLM-Knowledge-Base]] | Karpathy's pattern: 3-layer wiki replacing RAG                |

log.md — Activity Log

  • Append-only chronological record
  • Every ingest, query, and lint pass recorded
  • Parse last 5 entries:
1
grep "^## \[" wiki/log.md | tail -5

Obsidian Vault Implementation

Directory Layout

1
2
3
4
5
6
7
8
9
10
11
12
wiki/
├── inbox/
│   └── Clippings/          # Layer 1: Raw sources — web clips, articles, PDFs
│                            # IMMUTABLE — LLM reads but NEVER modifies
├── pages/                  # Layer 2: LLM-maintained wiki pages
│   ├── concepts/           # Ideas, patterns, technologies, methods
│   ├── topics/             # Broad topic summaries / synthesis pages
│   └── index.md            # Content catalog — read this first on every query
├── _templates/
│   └── page.md             # Template for new wiki pages
├── log.md                  # Append-only activity log
└── SCHEMA.md               # Layer 3: Operating instructions for the LLM

Page Format

Every wiki page uses this frontmatter:

1
2
3
4
5
6
7
8
---
type: wiki-concept | wiki-entity | wiki-topic | wiki-summary
date: YYYY-MM-DD          # date first created
updated: YYYY-MM-DD       # date last updated
tags: [tag1, tag2]
sources: [filename.md]    # raw source files this page draws from
status: active | stub | needs-review
---

Heading structure:

1
2
3
4
5
6
# Title
## Summary          — 2–4 sentence synthesis
## Key Points       — bulleted facts, claims, definitions
## Connections      — [[wikilinks]] to related pages
## Open Questions   — gaps, contradictions, things to investigate
## Sources          — links back to inbox/ raw files

Schema File (SCHEMA.md)

The SCHEMA.md tells the LLM exactly how to operate:

  • Directory structure and layer definitions
  • Page format and frontmatter spec
  • Step-by-step procedures for Ingest, Query, Lint
  • Conventions: raw sources are immutable, good answers get filed, index stays current, log is append-only

This file is co-evolved by human and LLM. As edge cases arise, add them here. Think of it as the LLM’s “job description” for maintaining the wiki.


Real Vault Stats

This ObsidianV vault runs the LLM-KB pattern with a cybersecurity focus:

MetricValue
Total wiki pages102
Raw sources ingested769
Top-level categories19
Largest subcategoryAWS (~100+ files: IAM, EKS, boto3, CI/CD, ML)
Richest AI categoryAI/LLM (OWASP LLM Top 10, RAG, fine-tuning, MCP security)
Init date2026-04-08

Batch ingest example — 768 SecurityKB source files → 100 wiki pages in one pass:

1
2
3
4
5
## [2026-04-08] ingest | SecurityKB batch ingest — 768 source files → 100 wiki pages
- Source: wiki/reference/SecurityKB/_posts/ (768 .md files across 19 categories)
- Strategy: subcategory-level granularity — one wiki page per subfolder
- Pages created: 100 pages across wiki/pages/topics/
- Index updated: wiki/pages/index.md (101 total pages)

Obsidian Tips

TipDetails
Web ClipperBrowser extension converts articles → markdown for inbox/Clippings/
Graph viewVisualize wiki topology — spot orphan pages and hub pages
DataviewAdd YAML frontmatter (tags, date, source) → render dynamic tables
MarpGenerate slide decks directly from wiki pages (Obsidian plugin available)
GitWiki is just markdown files → version history and branching are free
Download images locallySettings → Files and links → set attachment path; bind hotkey for “Download attachments”
qmdLocal CLI + MCP server for hybrid BM25/vector search across markdown — useful when index.md alone isn’t enough at scale (500+ pages)

“Obsidian is the IDE; the LLM is the programmer; the wiki is the codebase.” — Keep LLM agent open on one side, Obsidian on the other. LLM edits, you browse in real time.


Key Takeaway

The core insight: maintenance cost is why wikis die.

  • Humans abandon wikis because bookkeeping grows faster than value
  • LLMs don’t get bored, don’t forget to update cross-references, can touch 15 files in one pass
  • The Memex vision (Vannevar Bush, 1945) — a private, curated knowledge store with associative trails — was always the right idea. The missing piece was who does the maintenance. The LLM handles that.

The workflow in one line:

Spend your time curating sources and asking good questions. Let the LLM do everything else.


References

  • Andrej Karpathy — LLM Wiki Gist
  • qmd — local markdown search (BM25/vector hybrid, CLI + MCP)
  • Obsidian Web Clipper — browser extension for markdown clipping
  • Vannevar Bush — As We May Think (1945) — conceptual ancestor of LLM-KB
This post is licensed under CC BY 4.0 by the author.

Comments powered by Disqus.