Available now

Local language model generation and embeddings—private by default

Run compact models locally for text generation and semantic embeddings. No data leaves your device unless you explicitly opt into cloud routing.

Install & verify locally See generation API

What the LLM Engine does

The LLM Engine runs a compact language model entirely on your machine. It powers text generation for summaries, drafts, and planning, plus semantic embeddings for vector search. Setup includes a verification test that confirms the model produces non-echo completions, ensuring real reasoning capability.

Core capabilities

Local text generation with configurable temperature and max tokens
Semantic embeddings for vector memory and RAG
Non-echo verification tests during setup
Optional cloud model routing (opt-in only)

Model selection

Default: Compact model optimized for CPU inference
GPU acceleration supported for faster generation
Model catalog UI (planned) for easy switching

Private by default: All generation and embeddings run locally. Cloud models require setting LLM_ENABLE_CLOUD=true and providing API keys.

Who benefits from the LLM Engine

Individuals

Private summaries and drafts without sending data to the cloud

Example: Summarize personal notes or draft emails entirely offline—no API keys, no network calls, no data leakage.

Teams & Managers

Predictable, controllable generation with full logs and no vendor lock-in

Example: Generate standardized reports on-prem with consistent formatting and auditable prompts.

Developers & IT

Stable /generate and /embed contracts with local-first defaults

Example: Integrate LLM calls into CI/CD or internal tools without managing cloud API keys or rate limits.

Security & Compliance

No data leaves the device unless explicitly opted-in

Control: Enforce local-only mode in policy; audit all prompts and completions for compliance.

How it works

Setup: Download & Verify

The setup script downloads a compact model (typically 1-3GB) and runs a verification test to confirm non-echo completions.

Generate: /generate endpoint

Send a prompt to /generate with optional parameters (temperature, max_tokens). The model produces a completion locally.

Embed: /embed endpoint

Send text to /embed to get semantic vectors for use with vector memory or RAG workflows.

Optional: Cloud routing

Set LLM_ENABLE_CLOUD=true and provide API keys to route requests to cloud models when needed.

Verification: The non-echo test confirms the model can reason and doesn't just repeat the prompt. This test runs automatically during setup.

Example workflows

Offline note summarization

Runs entirely offline

Input:

Weekly meeting notes (3 markdown files)

Steps:

docs.search_docs (retrieve notes)
llm.generate (prompt: "Summarize key action items")
Return summary text

Output:

Bulleted summary with action items—no network calls, no API keys, fully private

Verification:

Non-echo test passed during setup

Private RAG over local docs

Runs entirely offline

Input:

"What did we decide about the Q3 roadmap?"

Steps:

llm.embed (query text)
vector.search (find relevant docs)
llm.generate (answer question with context)

Output:

Answer with citations—all embeddings and generation happen locally

Polite email draft

Approval before sending

Input:

"Draft a follow-up email to Alice about the project delay"

Steps:

llm.generate (draft email with polite tone)
comms.draft_message (format as email)
Pause for approval before sending

Output:

Draft email ready for review—no send until approved

Technical details

Key endpoints

POST /generate
POST /embed
GET /models
POST /verify

View full API schema

Configuration

LLM_ENABLE_CLOUD — false (default)
MODEL_PATH — local model directory
MAX_TOKENS — default 512
TEMPERATURE — default 0.7

Performance notes

CPU inference: 5-15 tokens/sec (depends on hardware)
GPU acceleration: 50-100+ tokens/sec with CUDA
Embedding generation: ~100ms per text chunk
Model size: 1-3GB (compact models)

Observability

Token counts (prompt + completion)
Generation latency and throughput
Error rates and timeout counters
Model load time and memory usage

Security posture

Local execution by default

All generation and embeddings run on your device. No network calls unless cloud mode is explicitly enabled.

No data leakage

Prompts and completions never leave your machine in local mode. Cloud routing requires explicit configuration and API keys.

Verification tests

Setup includes a non-echo test that confirms the model can reason and doesn't just repeat inputs.

Audit logging

All prompts and completions are logged for compliance and debugging. Logs stay local by default.

Roadmap & status

Available

Current features

Local text generation and embeddings
Non-echo verification tests
Optional cloud model routing

Planned

Coming soon

Streaming generation for real-time output
GPU acceleration with CUDA/Metal support
Model catalog UI for easy switching

View full roadmap

Frequently asked questions

Ready to run models locally?

Install MCP and verify your local LLM in minutes—no API keys, no cloud dependencies

Install & verify API reference Talk to us

Local language model generation and embeddings—private by default

What the LLM Engine does

Core capabilities

Model selection

Who benefits from the LLM Engine

Individuals

Teams & Managers

Developers & IT

Security & Compliance

How it works

Setup: Download & Verify

Generate: /generate endpoint

Embed: /embed endpoint

Optional: Cloud routing

Example workflows

Offline note summarization

Private RAG over local docs

Polite email draft

Technical details

Key endpoints

Configuration

Performance notes

Observability

Security posture

Local execution by default

No data leakage

Verification tests

Audit logging

Roadmap & status

Current features

Coming soon

Frequently asked questions

What model does MCP use by default?

Can I use my own model?

How do I enable cloud models?

What is the non-echo verification test?

Can I run MCP without a GPU?

Are prompts and completions logged?

How do embeddings work?

Ready to run models locally?