Skip to main content
MCP Documentation

LLM Engine Service Guide

Serve private local models for generation and embeddings with opt-in routing to cloud providers when required.

Local-first inference

The default configuration downloads and serves a compact model via the `llm-local` runtime so prompts never leave your machine.

  • Supports CPU-only execution with optional GPU acceleration.
  • Deterministic seed and temperature controls keep outputs reproducible.
  • Model upgrades are versioned and canary tested via the download verifier.

Echo avoidance tests

Every build runs a non-echo regression to ensure the model produces novel reasoning before it is promoted.

  • `mcp doctor --llm` triggers the verification harness on demand.
  • Evaluation metrics (entropy, similarity) stream into episodic memory for auditing.
  • Failures revert to the previous known-good model automatically.

Embeddings

`/embed` delivers dense vectors that integrate with vector memory and search services without exposing documents to the cloud.

  • Supports batching and dimension configuration per workspace.
  • Metadata tags track provenance and retention policy requirements.
  • Optional cloud connectors can be toggled per call with signed approvals.