Continuous Development

Recent Updates

We ship improvements weekly. Here's what we've been building.

Mar 31

Unified Activity Feed

Console activity is now a single workload + inference stream instead of a billing-only feed, so operators can see request flow, infra events, and model activity in one timeline.

Mar 30

Org Residency + Multi-Provider Routing

Routing now combines org residency controls with provider-aware selection and managed-first fallback. This landed alongside expanded proprietary route support and safer adapter error handling in streaming paths.

Mar 30

37-Model Self-Hosted Catalogue

The self-hosted catalogue expanded to 37 models with cleaner canonical naming and undeployed model filtering, giving teams a larger deployable set without exposing unfinished inventory.

Mar 30

Preview/Staging Infrastructure Hardening

Preview environments now use wildcard TLS, improved metadata-sidecar startup ordering, and automated cluster bootstrap. Staging control plane migration and cost optimization updates are also in place.

Mar 29

APAC Coverage Expansion

Proprietary routing now includes Thailand and Malaysia regions, with geo-nearest fallback-chain fixes so requests remain residency-aware while still failing over safely across APAC.

Mar 29

Model UX Refresh

Model catalogue and detail pages were overhauled with improved sorting, copyable IDs, better quickstart examples, clearer availability indicators, and stronger mobile responsiveness.

Mar 28

Data-Driven Provider Routing

Provider routing moved to Redis-backed data configuration with per-region adapter seeding, enabling faster model/provider updates without code deploys and cleaner operations runbooks.

Mar 25

Admin Console

Operator dashboard now live, real-time health, GPU agent visibility, cross-org deployment tracking, and live inference metrics. Role-gated to admin users.

Mar 24

Auth Hardening

Fixed a session endpoint that was inadvertently skipping JWT validation in non-local environments. User roles now resolved from the database, not JWT claims.

Mar 23

Phase 2 Go-Live

Cleared final production blockers: context-length formatting fixed for ≥1M token models, image tags pinned across all Helm charts, orchestrator cross-cluster config wired up.

Mar 19

18-Model Managed Catalogue

Router catalogue finalised at 18 models, Llama 4 Scout, Qwen 3.5 32B, BGE-M3 embeddings, Llama 3.2 Vision, and more. Every model has a dedicated GPU tier, pricing spec, and latency grade.

Mar 19

Router Fee Transparency

The 5.5% routing fee is now a first-class field in the pricing API. Two-debit reconciliation separates user charges from internal margin, clean audit trail from day one.

Mar 18

vLLM v0.17.1 + Prefix Caching

Upgraded from v0.6.6 to v0.17.1 with KV-cache prefix caching enabled by default. Agents and chat workloads with repeated system prompts see meaningful TTFT improvements.

Mar 18

Gemini on Router

Google Gemini is now routable via the APAC Router alongside Claude and open-weight models. Provider normalisation handled in the gateway, same API, no client changes needed.

Mar 18

HTTP/2 Router Pooling

Go router now maintains persistent HTTP/2 connection pools to inference backends. Eliminates per-request TLS handshakes, p50 router overhead drops from ~300ms to ~12ms.

Mar 17

Cross-Cluster Orchestration

Orchestrator can now manage vLLM pods on a separate data-plane cluster from a central control plane. Groundwork for active-active multi-region without per-region control planes.

Mar 16

Phase 2 Platform Launch

The APAC Model Router and unified inference API are live. One OpenAI-compatible endpoint for proprietary and open models, 18 managed open models, GPU workloads, and embeddings. Streaming, tool use, and vision supported.

Mar 16

TTFB Metering & Geo Attribution

Time-to-first-byte and client geo-attribution now captured on every inference request. Latency percentiles visible in the console, per-model, per-city, graded from Excellent (<50ms) to Poor (>300ms).

Mar 15

Team Accounts

Organisations are now a first-class concept. Invite teammates, share API keys, and manage billing under a single account. New members are auto-assigned to your org on first login.

Mar 13

v1.0.0 Production Baseline

Clean-slate production baseline cut and deployed. Backend, router, orchestrator, and console all on stable release tags. Zero-downtime rolling updates enabled, no more maintenance windows for deploys.

Mar 10

Model Weight Loader

vLLM pods now pull model weights from GCS on startup via a dedicated init container, Workload Identity means no credentials in the spec. 8B models cold-start in under 3 minutes; 70B in ~9.

Mar 8

Kubernetes Actuator

Orchestrator now creates and destroys vLLM Deployment + Service pairs directly via the K8s API. Pod lifecycle is fully managed: startup → health probing → scale-to-zero → wake-on-request.

Mar 6

Circuit Breaker per Backend

Go router now tracks per-backend failure rates and opens a circuit on sustained errors. Unhealthy backends are excluded from routing decisions without manual intervention.

Mar 3

Go Router Scaffolding

Inference router rebuilt in Go for lower overhead and tighter connection pooling. Model registry lives in Redis with a gRPC + HTTP/JSON bridge. Stateless, horizontally scalable from day one.

Feb 28

Latency Probe Network

Distributed probe workers now measure p50/p95/p99 TTFT from multiple APAC cities and write results back via a service-token-authenticated ingest endpoint. Console dashboards pull live data.

Feb 26

API Key Management

Full API key lifecycle in the console, create, label, set rate limits, and revoke programmatic keys. Per-key sliding-window rate limiting enforced at the edge before any inference work starts.

Feb 24

Async Metering Pipeline

Billing events now flow through NATS before landing in Postgres. Decouples inference latency from billing writes, high-throughput bursts no longer cause request slowdown.

Feb 21

Provider Gateway Alpha

Anthropic and OpenAI adapters live in a new provider gateway service. Normalises tool use, streaming, and model-ID remapping across providers, one integration pattern for all of them.

Feb 19

Wallet Pre-flight Checks

Inference requests now check wallet balance before routing. Zero-balance requests fail fast with a clear 402 rather than completing and leaving a negative balance.

Feb 17

Tensor Parallel on 70B Models

Building on our vLLM production milestone earlier this month, tensor parallel inference is now stable across multi-GPU nodes. Llama 70B and Qwen 32B deploy cleanly across 2× A100s.

Feb 14

Embedding Models

BGE Large, BGE-M3, and multilingual E5 now available as managed embedding models. Same OpenAI-compatible `/v1/embeddings` endpoint, drop-in replacement for teams moving off US-based providers.

Feb 12

Usage Analytics

Token counts, cost breakdown, and latency percentiles now in the console. Filter by model, date range, or API key. First step toward per-project budgets.

Feb 10

Rate Limiting Hardened

Per-API-key token windows tightened and now enforced at the edge. Sliding-window algorithm replaces fixed buckets, burst-friendly but protects against runaway loops.