GemmaKit

Private AI
running inside your app.

GemmaKit turns Gemma 4 E2B into an Apple Silicon-optimised local text runtime for apps that need private inference without a cloud dependency. The 4-bit MLX repack is 2.63 GB on disk, runs 902.7 MiB leaner than the source checkpoint, and has passed a 500-generation structured-output validation run at 100% parseability. It drops behind an OpenAI-compatible Chat Completions API, with prompt content kept on the device.

2.63 GB runtime 902.7 MiB smaller 500-run validated Apple Silicon
Running on device offline
9:41 Notebook • on device Plan my week around two deep work blocks. Tuesday and Thursday mornings are clearest. Block 9–11 on each — tag them as Focus and decline meetings. Add them to my calendar. Message
Client
Swift · OpenAI-style
Runtime
local Gemma server
Network
licence channel only
GemmaKit / Pro private beta
OpenAI-compatible · subset Chat completions only Apple Silicon

Point your existing OpenAI-compatible client at a local base URL.

No SDK rewrite. Same JSON shape on the supported subset. Streamed or buffered, with optional local bearer-token auth.

# Stream a chat completion against the local server
curl http://127.0.0.1:11436/v1/chat/completions \
  -H "Authorization: Bearer $GEMMAKIT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-e2b-it",
    "stream": true,
    "stream_options": { "include_usage": true },
    "messages": [
      { "role": "system", "content": "You are concise." },
      { "role": "user",   "content": "Summarise the changelog above." }
    ]
  }'

Gemma 4 power, repacked for the local footprint.

A 0.7B-parameter, 4-bit text runtime built for Apple Silicon apps: 2.63 GB on disk, 1,234 text tensors kept, 1,415 unused audio and vision tensors removed, and 100% parseability across a 500-generation structured-output validation run.

  Included

  • +Gemma 4 E2B instruction text model in an Apple Silicon-optimised 4-bit MLX package
  • +Text-only repack that saves 902.7 MiB against the source MLX checkpoint
  • +Internal 500-generation validation run with 100% parseability and 100% determinism
  • +Runtime shape built for quick local loading inside app-controlled storage
  • +Local text chat completions on the customer device
  • +OpenAI-compatible Chat Completions endpoint
  • +Swift client helpers for app integration
  • +Buffered and streamed responses

  Out of scope

  • Cloud inference of any kind
  • Full OpenAI API or Responses API parity
  • Tool / function calling, images, audio, embeddings
  • Hosted retrieval or stored completions
  • Automatic model download or remote registry
  • Unlimited offline entitlement
  • Replacement for legal review of model distribution

A request, a stream, a local response.

Click Run to preview the streaming shape. The response is illustrative; production tokens are generated by the local runtime on the device.

POST 127.0.0.1:11436/v1/chat/completions ready
user
Explain the difference between buffered and streamed responses in two sentences.
Model
gemma-4-e2b-it
Stream
true
Latency · ttft
Tokens
0

No prompts leave the device. Not in the demo. Not in production.

Prompts, completions, local documents, model artefacts, and embeddings are not sent to the licence service. The runtime binds to 127.0.0.1 by default, and only the licence channel reaches the network.

bind
127.0.0.1:11436
auth
local bearer · optional
cors
configurable
egress
licence channel only
content
not sent server-side

Every piece of the runtime is replaceable, inspectable, and on the device.

Six components. One binary. A Swift package. The rest is your app.

01 / runtime

Local Gemma server

A converted Gemma text model packaged behind a local HTTP server. Binds to 127.0.0.1 by default and never opens external ports.

02 / api

OpenAI-compatible subset

The Chat Completions endpoint, with the same JSON request and SSE response shape your existing client already speaks. No Responses API.

03 / client

Swift helpers

A small Swift surface for issuing chat completions, handling streamed deltas, and managing local bearer tokens from inside an Apple-platform app.

04 / auth

Bearer + CORS

Optional local bearer-token enforcement and configurable CORS for embedding the runtime in a webview-bearing app or a sibling local web tool.

05 / licence

Pro org licensing

Pro organisation keys, optional app-id binding, signed local licence certificates, and active-device reporting — without sending prompts or completions.

06 / scope

Just text

Text in, text out. No images, audio, embeddings, tool calls, retrieval, or stored completions — those are intentionally outside the boundary.

Monthly platform fee plus active-device usage.

A device counts once per billing period after activation, licence refresh, or a gated generation call. Repeated requests do not become per-token or per-request billing.

Probase + device meter
Configured in Stripe · billed monthly
  • Pro organisation keys with optional app-id binding
  • Signed local licence certificates
  • Active-device reporting · no prompt content
  • Device and certificate revocation paths
sample
{
  "org_id":        "org_4e2fa1",
  "app_id":        "app.example.studio",
  "device_id":     "dev_opaque_installation_id",
  "event":         "gated_generation",
  "period_start":  "2026-05-01T00:00:00Z",
  "prompt_content": null,
  "completion_content": null,
  "certificate_id": "lic_9c1a..."
}
— what gets reported, in full

One package. Drop it next to your app.