Local language model generation and embeddings—private by default
Run compact models locally for text generation and semantic embeddings. No data leaves your device unless you explicitly opt into cloud routing.
What the LLM Engine does
The LLM Engine runs a compact language model entirely on your machine. It powers text generation for summaries, drafts, and planning, plus semantic embeddings for vector search. Setup includes a verification test that confirms the model produces non-echo completions, ensuring real reasoning capability.
Core capabilities
- Local text generation with configurable temperature and max tokens
- Semantic embeddings for vector memory and RAG
- Non-echo verification tests during setup
- Optional cloud model routing (opt-in only)
Model selection
- Default: Compact model optimized for CPU inference
- GPU acceleration supported for faster generation
- Model catalog UI (planned) for easy switching
LLM_ENABLE_CLOUD=true and providing API keys.Who benefits from the LLM Engine
Individuals
Private summaries and drafts without sending data to the cloud
Teams & Managers
Predictable, controllable generation with full logs and no vendor lock-in
Developers & IT
Stable /generate and /embed contracts with local-first defaults
Security & Compliance
No data leaves the device unless explicitly opted-in
How it works
Setup: Download & Verify
The setup script downloads a compact model (typically 1-3GB) and runs a verification test to confirm non-echo completions.
Generate: /generate endpoint
Send a prompt to /generate with optional parameters (temperature, max_tokens). The model produces a completion locally.
Embed: /embed endpoint
Send text to /embed to get semantic vectors for use with vector memory or RAG workflows.
Optional: Cloud routing
Set LLM_ENABLE_CLOUD=true and provide API keys to route requests to cloud models when needed.
Example workflows
Offline note summarization
Runs entirely offlineWeekly meeting notes (3 markdown files)
- docs.search_docs (retrieve notes)
- llm.generate (prompt: "Summarize key action items")
- Return summary text
Bulleted summary with action items—no network calls, no API keys, fully private
Non-echo test passed during setup
Private RAG over local docs
Runs entirely offline"What did we decide about the Q3 roadmap?"
- llm.embed (query text)
- vector.search (find relevant docs)
- llm.generate (answer question with context)
Answer with citations—all embeddings and generation happen locally
Polite email draft
Approval before sending"Draft a follow-up email to Alice about the project delay"
- llm.generate (draft email with polite tone)
- comms.draft_message (format as email)
- Pause for approval before sending
Draft email ready for review—no send until approved
Technical details
Configuration
LLM_ENABLE_CLOUD— false (default)MODEL_PATH— local model directoryMAX_TOKENS— default 512TEMPERATURE— default 0.7
Performance notes
- CPU inference: 5-15 tokens/sec (depends on hardware)
- GPU acceleration: 50-100+ tokens/sec with CUDA
- Embedding generation: ~100ms per text chunk
- Model size: 1-3GB (compact models)
Observability
- Token counts (prompt + completion)
- Generation latency and throughput
- Error rates and timeout counters
- Model load time and memory usage
Security posture
Local execution by default
All generation and embeddings run on your device. No network calls unless cloud mode is explicitly enabled.
No data leakage
Prompts and completions never leave your machine in local mode. Cloud routing requires explicit configuration and API keys.
Verification tests
Setup includes a non-echo test that confirms the model can reason and doesn't just repeat inputs.
Audit logging
All prompts and completions are logged for compliance and debugging. Logs stay local by default.
Roadmap & status
Current features
- Local text generation and embeddings
- Non-echo verification tests
- Optional cloud model routing
Coming soon
- Streaming generation for real-time output
- GPU acceleration with CUDA/Metal support
- Model catalog UI for easy switching
Frequently asked questions
Ready to run models locally?
Install MCP and verify your local LLM in minutes—no API keys, no cloud dependencies