Shared Content Library
The shared content library lets you ingest books, websites, papers, and
documents into the knowledge graph. Documents are parsed into structural
elements (chapters, sections, paragraphs), stored as KG symbols with
well-known doc:* predicates, and embedded via VSA for semantic search.
How It Works
File / URL
│
▼
┌──────────┐ ┌───────────┐ ┌──────────┐ ┌───────────┐
│ Parse │ --> │ Chunk │ --> │ Extract │ --> │ Embed │
│ HTML/PDF/ │ │ normalize │ │ triples │ │ VSA vecs │
│ EPUB/text │ │ 200-500w │ │ NLP │ │ per chunk │
└──────────┘ └───────────┘ └──────────┘ └───────────┘
│ │
▼ ▼
┌──────────┐ ┌───────────┐
│ Catalog │ catalog.json with document metadata │ Item Mem │
└──────────┘ └───────────┘
- Parse -- Format-specific parser (HTML, PDF, EPUB, plain text) extracts headings, paragraphs, and metadata.
- Chunk -- Short paragraphs are merged and long ones split to produce chunks targeting 200-500 words for consistent NLP quality.
- Extract -- NLP extraction discovers triples from each chunk's text.
- Embed -- Each chunk is encoded as a VSA hypervector and stored in item memory for semantic search.
Supported Formats
| Format | Detection | Parser |
|---|---|---|
| HTML | .html/.htm extension or URL source | scraper crate -- extracts <h1>-<h6>, <p>, <meta> |
.pdf extension | pdf-extract crate -- page-level text extraction | |
| EPUB | .epub extension | epub crate -- spine items become chapters |
| Plain text | fallback | Splits on double newlines for paragraphs |
Document Structure in the KG
Each ingested document creates a hierarchy of symbols:
doc:{slug} # Document root
├── doc:has_chapter → ch:{slug}:0
│ └── doc:has_paragraph → para:{slug}:0
│ └── doc:has_paragraph → para:{slug}:1
│ └── doc:has_section → sec:{slug}:0:1
├── doc:has_chapter → ch:{slug}:1
│ └── doc:has_paragraph → para:{slug}:2
...
Well-Known Predicates
| Predicate | Description |
|---|---|
doc:has_chapter | Document → chapter |
doc:has_section | Chapter → section |
doc:has_paragraph | Document/chapter → paragraph chunk |
doc:next_chunk | Paragraph → next paragraph (reading order) |
doc:has_title | Document → title string |
doc:has_author | Document → author string |
doc:has_format | Document → format (html, pdf, epub, text) |
doc:has_source | Document → source path or URL |
doc:has_language | Document → language code |
doc:has_description | Document → description string |
doc:has_keyword | Document → keyword |
doc:has_tag | Document → user tag |
doc:chunk_text | Paragraph → raw text content |
doc:chunk_index | Paragraph → ordinal position |
Catalog
The catalog is a persistent JSON index at ~/.local/share/akh-medu/library/catalog.json
that tracks all ingested documents with their metadata. It stores the document
ID (slug), title, format, source path/URL, tags, and chunk count.
CLI Commands
library add
Add a document to the library from a file path or URL.
akh-medu library add paper.pdf
akh-medu library add https://example.com/article.html --title "My Article"
akh-medu library add book.epub --tags "physics,textbook"
akh-medu library add notes.txt --format text
| Option | Description |
|---|---|
--title <TEXT> | Override document title |
--tags <LIST> | Comma-separated tags |
--format <FMT> | Override format detection: html, pdf, epub, text |
library list
List all documents in the library.
akh-medu library list
library search
Search library content by text similarity.
akh-medu library search --query "quantum entanglement" --top-k 10
| Option | Description | Default |
|---|---|---|
--query <TEXT> | Search text | Required |
--top-k <N> | Maximum results | 5 |
library info
Show detailed information about a document.
akh-medu library info quantum-mechanics-textbook
library remove
Remove a document from the library.
akh-medu library remove quantum-mechanics-textbook
library watch
Watch a directory for new files and auto-ingest them. Defaults to the
library inbox directory (~/.local/share/akh-medu/library/inbox/).
akh-medu library watch
akh-medu library watch --dir /path/to/papers/
Agent Integration
The agent has two tools for working with library content:
-
content_ingest-- Ingest a document (file or URL) into the library. The agent uses this when a goal involves importing or learning from external content. See Tools. -
library_search-- Search ingested library paragraphs by natural language query via VSA similarity. The agent uses this when a goal asks about previously ingested content (e.g., "What did that paper say about gravity?"). See Tools.
Both tools are scored by the OODA loop's VSA-based tool selector using synonym-expanded keyword profiles, so natural language goals reliably activate them.
Compartment Integration
Each ingested document is stored in its own knowledge compartment
(library:{slug}), which can be mounted by any workspace. This keeps
document knowledge isolated until explicitly shared. See
Knowledge Compartments for details.