← projects

SignalVision: Browser ML via ONNX int8 Quantization

2026.07.04ONNX · quantization · PyTorch · computer vision · edge AI

SignalVision takes an existing PyTorch sign-language recognition model and adds deployment engineering: ONNX export, int8 quantization, and in-browser inference via onnxruntime-web. The pitch is not "I trained a model" — it's "I take existing models and make them run efficiently in constrained environments."

At a glance

  • Problem: A trained CNN is not a deployable product. Browser inference demands small models, predictable latency, and no server round-trip.
  • Approach: Export to ONNX fp32, compare dynamic vs static int8 quantization, ship static int8 with calibrated activation ranges, serve via onnxruntime-web with WASM caching.
  • Stack: PyTorch · ONNX · onnxruntime-web · Next.js 14 · TypeScript.
  • Headline result: Static int8 model is 3.9× smaller with 3.4× faster p50 latency vs PyTorch baseline, at −0.34 pp accuracy cost.

Architecture & logic

The deployment path matters more than the architecture of the CNN itself — a small conv network trained on sign-language frames. The engineering work is the export graph, quantization strategy, and browser runtime.

Why ONNX: Single graph representation that onnxruntime can optimize across CPU, WASM, and (eventually) WebGPU backends without tying deployment to PyTorch's Python runtime.

Why static int8 over dynamic: Dynamic int8 only quantizes MatMul/Gemm ops; Conv layers stay fp32 with quant/dequant overhead at every inference. On this conv-heavy model, dynamic int8 was slower than fp32 ONNX. Static int8 (QDQ, calibrated on 512 training samples) lets Conv layers run in int8 — where both speed and size wins come from — at the cost of a real accuracy drop.

Benchmarks & results

All numbers measured on CPU (x86_64, 4 cores), batch size 1, 200 timed runs after warmup. Accuracy on the original test split (7172 samples):

| Model | Size | Latency p50 | Accuracy | | --- | ---: | ---: | ---: | | PyTorch baseline (fp32) | 1.51 MB | 0.287 ms | 95.59% | | ONNX export (fp32) | 1.51 MB | 0.105 ms | 95.59% | | ONNX dynamic int8 | 0.39 MB | 0.490 ms | 95.64% | | ONNX static int8 (shipped) | 0.39 MB | 0.085 ms | 95.25% |

The accuracy cost is not zero, and that's expected: 95.59% → 95.25% (−0.34 pp, 25 additional misclassifications out of 7172). A quantized model showing zero drop would be a signal to re-check the eval, not a win.

Caveat on absolute latency: this model is small, so per-frame latency is sub-millisecond on a server-class CPU. The relative deltas are the transferable result; the browser demo reports live per-frame latency on whatever device is viewing it.

What I'd do differently

WebGPU backend. WASM is the portable default; WebGPU would be the next step for devices that support it, with WASM fallback.

Per-device calibration. Static int8 ranges were calibrated on a server CPU. Browser WASM may behave differently — I would capture a small calibration set from target devices.

Model versioning in the client. Production would need cache-busted model URLs, semver on the ONNX artifact, and graceful degradation when a new model fails to load.

Stack & links

Related portfolio work: LeadSignal (rules + Claude enrichment for RevOps), DocSignal (hybrid RAG retrieval).