LLM Engine Service Guide
Serve private local models for generation and embeddings with opt-in routing to cloud providers when required.
Local-first inference
The default configuration downloads and serves a compact model via the `llm-local` runtime so prompts never leave your machine.
- Supports CPU-only execution with optional GPU acceleration.
- Deterministic seed and temperature controls keep outputs reproducible.
- Model upgrades are versioned and canary tested via the download verifier.
Echo avoidance tests
Every build runs a non-echo regression to ensure the model produces novel reasoning before it is promoted.
- `mcp doctor --llm` triggers the verification harness on demand.
- Evaluation metrics (entropy, similarity) stream into episodic memory for auditing.
- Failures revert to the previous known-good model automatically.
Embeddings
`/embed` delivers dense vectors that integrate with vector memory and search services without exposing documents to the cloud.
- Supports batching and dimension configuration per workspace.
- Metadata tags track provenance and retention policy requirements.
- Optional cloud connectors can be toggled per call with signed approvals.