Skip to main content
Available now

Local language model generation and embeddings—private by default

Run compact models locally for text generation and semantic embeddings. No data leaves your device unless you explicitly opt into cloud routing.

What the LLM Engine does

The LLM Engine runs a compact language model entirely on your machine. It powers text generation for summaries, drafts, and planning, plus semantic embeddings for vector search. Setup includes a verification test that confirms the model produces non-echo completions, ensuring real reasoning capability.

Core capabilities

  • Local text generation with configurable temperature and max tokens
  • Semantic embeddings for vector memory and RAG
  • Non-echo verification tests during setup
  • Optional cloud model routing (opt-in only)

Model selection

  • Default: Compact model optimized for CPU inference
  • GPU acceleration supported for faster generation
  • Model catalog UI (planned) for easy switching
Private by default: All generation and embeddings run locally. Cloud models require setting LLM_ENABLE_CLOUD=true and providing API keys.

Who benefits from the LLM Engine

Individuals

Private summaries and drafts without sending data to the cloud

Example: Summarize personal notes or draft emails entirely offline—no API keys, no network calls, no data leakage.

Teams & Managers

Predictable, controllable generation with full logs and no vendor lock-in

Example: Generate standardized reports on-prem with consistent formatting and auditable prompts.

Developers & IT

Stable /generate and /embed contracts with local-first defaults

Example: Integrate LLM calls into CI/CD or internal tools without managing cloud API keys or rate limits.

Security & Compliance

No data leaves the device unless explicitly opted-in

Control: Enforce local-only mode in policy; audit all prompts and completions for compliance.

How it works

1

Setup: Download & Verify

The setup script downloads a compact model (typically 1-3GB) and runs a verification test to confirm non-echo completions.

2

Generate: /generate endpoint

Send a prompt to /generate with optional parameters (temperature, max_tokens). The model produces a completion locally.

3

Embed: /embed endpoint

Send text to /embed to get semantic vectors for use with vector memory or RAG workflows.

4

Optional: Cloud routing

Set LLM_ENABLE_CLOUD=true and provide API keys to route requests to cloud models when needed.

Verification: The non-echo test confirms the model can reason and doesn't just repeat the prompt. This test runs automatically during setup.

Example workflows

Offline note summarization

Runs entirely offline
Input:

Weekly meeting notes (3 markdown files)

Steps:
  1. docs.search_docs (retrieve notes)
  2. llm.generate (prompt: "Summarize key action items")
  3. Return summary text
Output:

Bulleted summary with action items—no network calls, no API keys, fully private

Verification:

Non-echo test passed during setup

Private RAG over local docs

Runs entirely offline
Input:

"What did we decide about the Q3 roadmap?"

Steps:
  1. llm.embed (query text)
  2. vector.search (find relevant docs)
  3. llm.generate (answer question with context)
Output:

Answer with citations—all embeddings and generation happen locally

Polite email draft

Approval before sending
Input:

"Draft a follow-up email to Alice about the project delay"

Steps:
  1. llm.generate (draft email with polite tone)
  2. comms.draft_message (format as email)
  3. Pause for approval before sending
Output:

Draft email ready for review—no send until approved

Technical details

Key endpoints

  • POST /generate
  • POST /embed
  • GET /models
  • POST /verify
View full API schema

Configuration

  • LLM_ENABLE_CLOUD — false (default)
  • MODEL_PATH — local model directory
  • MAX_TOKENS — default 512
  • TEMPERATURE — default 0.7

Performance notes

  • CPU inference: 5-15 tokens/sec (depends on hardware)
  • GPU acceleration: 50-100+ tokens/sec with CUDA
  • Embedding generation: ~100ms per text chunk
  • Model size: 1-3GB (compact models)

Observability

  • Token counts (prompt + completion)
  • Generation latency and throughput
  • Error rates and timeout counters
  • Model load time and memory usage

Security posture

Local execution by default

All generation and embeddings run on your device. No network calls unless cloud mode is explicitly enabled.

No data leakage

Prompts and completions never leave your machine in local mode. Cloud routing requires explicit configuration and API keys.

Verification tests

Setup includes a non-echo test that confirms the model can reason and doesn't just repeat inputs.

Audit logging

All prompts and completions are logged for compliance and debugging. Logs stay local by default.

Roadmap & status

Available

Current features

  • Local text generation and embeddings
  • Non-echo verification tests
  • Optional cloud model routing
Planned

Coming soon

  • Streaming generation for real-time output
  • GPU acceleration with CUDA/Metal support
  • Model catalog UI for easy switching
View full roadmap

Frequently asked questions

Ready to run models locally?

Install MCP and verify your local LLM in minutes—no API keys, no cloud dependencies