Private AI
running inside your app.

GemmaKit turns Gemma 4 E2B into an Apple Silicon-optimised local text runtime for apps that need private inference without a cloud dependency. The 4-bit MLX repack is 2.63 GB on disk, runs 902.7 MiB leaner than the source checkpoint, and has passed a 500-generation structured-output validation run at 100% parseability. It drops behind an OpenAI-compatible Chat Completions API, with prompt content kept on the device.

Private beta Pro access Read the docs →

2.63 GB runtime 902.7 MiB smaller 500-run validated Apple Silicon

Running on device offline

Client: Swift · OpenAI-style
Runtime: local Gemma server
Network: licence channel only

GemmaKit / Pro private beta

OpenAI-compatible · subset Chat completions only Apple Silicon

Point your existing OpenAI-compatible client at a local base URL.

No SDK rewrite. Same JSON shape on the supported subset. Streamed or buffered, with optional local bearer-token auth.

# Stream a chat completion against the local server
curl http://127.0.0.1:11436/v1/chat/completions \
  -H "Authorization: Bearer $GEMMAKIT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-e2b-it",
    "stream": true,
    "stream_options": { "include_usage": true },
    "messages": [
      { "role": "system", "content": "You are concise." },
      { "role": "user",   "content": "Summarise the changelog above." }
    ]
  }'

// Using the GemmaKit Swift client
import Foundation
import GemmaKitCore

let client = GemmaKitClient(
  baseURL: URL(string: "http://127.0.0.1:11436/v1")!,
  apiKey: ProcessInfo.processInfo.environment["GEMMAKIT_API_KEY"] ?? "local"
)

let stream = client.streamChunks(
  model: "gemma-4-e2b-it",
  messages: [
    .system("You are concise."),
    .user("Summarise the changelog above.")
  ],
  includeUsage: true
)

for try await chunk in stream {
  if let text = chunk.choices.first?.delta.content {
    print(text, terminator: "")
  }
}

# Reuse the official OpenAI Python client
import os
from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:11436/v1",
    api_key=os.environ.get("GEMMAKIT_API_KEY", "local"),
)

stream = client.chat.completions.create(
    model="gemma-4-e2b-it",
    stream=True,
    stream_options={"include_usage": True},
    messages=[
        {"role": "system", "content": "You are concise."},
        {"role": "user",   "content": "Summarise the changelog above."},
    ],
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="")

// Reuse the official OpenAI TypeScript client
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://127.0.0.1:11436/v1",
  apiKey: process.env.GEMMAKIT_API_KEY ?? "local",
});

const stream = await client.chat.completions.create({
  model: "gemma-4-e2b-it",
  stream: true,
  stream_options: { include_usage: true },
  messages: [
    { role: "system", content: "You are concise." },
    { role: "user",   content: "Summarise the changelog above." },
  ],
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0].delta.content ?? "");
}

Gemma 4 power, repacked for the local footprint.

A 0.7B-parameter, 4-bit text runtime built for Apple Silicon apps: 2.63 GB on disk, 1,234 text tensors kept, 1,415 unused audio and vision tensors removed, and 100% parseability across a 500-generation structured-output validation run.

● Included

+Gemma 4 E2B instruction text model in an Apple Silicon-optimised 4-bit MLX package
+Text-only repack that saves 902.7 MiB against the source MLX checkpoint
+Internal 500-generation validation run with 100% parseability and 100% determinism
+Runtime shape built for quick local loading inside app-controlled storage
+Local text chat completions on the customer device
+OpenAI-compatible Chat Completions endpoint
+Swift client helpers for app integration
+Buffered and streamed responses

○ Out of scope

−Cloud inference of any kind
−Full OpenAI API or Responses API parity
−Tool / function calling, images, audio, embeddings
−Hosted retrieval or stored completions
−Automatic model download or remote registry
−Unlimited offline entitlement
−Replacement for legal review of model distribution

A request, a stream, a local response.

Click Run to preview the streaming shape. The response is illustrative; production tokens are generated by the local runtime on the device.

POST 127.0.0.1:11436/v1/chat/completions ready

user

Explain the difference between buffered and streamed responses in two sentences.

Model: gemma-4-e2b-it
Stream: true
Latency · ttft: —
Tokens: 0

No prompts leave the device. Not in the demo. Not in production.

Prompts, completions, local documents, model artefacts, and embeddings are not sent to the licence service. The runtime binds to 127.0.0.1 by default, and only the licence channel reaches the network.

bind

127.0.0.1:11436

auth

local bearer · optional

cors

configurable

egress

licence channel only

content

not sent server-side

Every piece of the runtime is replaceable, inspectable, and on the device.

Six components. One binary. A Swift package. The rest is your app.

01 / runtime

Local Gemma server

A converted Gemma text model packaged behind a local HTTP server. Binds to 127.0.0.1 by default and never opens external ports.

02 / api

OpenAI-compatible subset

The Chat Completions endpoint, with the same JSON request and SSE response shape your existing client already speaks. No Responses API.

03 / client

Swift helpers

A small Swift surface for issuing chat completions, handling streamed deltas, and managing local bearer tokens from inside an Apple-platform app.

04 / auth

Bearer + CORS

Optional local bearer-token enforcement and configurable CORS for embedding the runtime in a webview-bearing app or a sibling local web tool.

05 / licence

Pro org licensing

Pro organisation keys, optional app-id binding, signed local licence certificates, and active-device reporting — without sending prompts or completions.

06 / scope

Just text

Text in, text out. No images, audio, embeddings, tool calls, retrieval, or stored completions — those are intentionally outside the boundary.

Monthly platform fee plus active-device usage.

A device counts once per billing period after activation, licence refresh, or a gated generation call. Repeated requests do not become per-token or per-request billing.

Probase + device meter

Configured in Stripe · billed monthly

Pro organisation keys with optional app-id binding
Signed local licence certificates
Active-device reporting · no prompt content
Device and certificate revocation paths

Open the calculator → Licensing terms

sample

{
  "org_id":        "org_4e2fa1",
  "app_id":        "app.example.studio",
  "device_id":     "dev_opaque_installation_id",
  "event":         "gated_generation",
  "period_start":  "2026-05-01T00:00:00Z",
  "prompt_content": null,
  "completion_content": null,
  "certificate_id": "lic_9c1a..."
}

— what gets reported, in full

One package. Drop it next to your app.

Request Pro access private beta Read the docs

Private AI running inside your app.