March 2025

Serverless Inference

Our fully-managed, pay-per-request runtime that puts a pool of GPUs behind a single OpenAI-compatible endpoint. Instead of capacity planning, container images and infra dashboards, you call https://inference.api.nscale.com/v1/* and get deterministic, low-latency responses from today’s best open-source models—all billed per token and delivered from data-sovereign, 100% renewable data-centres.

Features

  • OpenAI-compatible endpoints. Drop-in support for Llama, Qwen, DeepSeek and other leading models makes migration a copy-paste job
  • Pay-as-you-go billing. Prices are per 1 million tokens including input and output tokens for Chat, Multimodal, Language and Code models. Image models is based on image size and steps
  • 80% lower cost & 100% renewable. Our vertically-integrated stack slashes TCO versus hyperscalers while guaranteeing data privacy—requests are never logged or reused
  • $5 free credits to get started. Every new account includes starter credits so you can ship to production in minutes

Under the hood

AreaWhat it looks likeWhy it matters
API surfaceDrop‑in equivalents for GET /models, POST /chat/completions, POST /images with optional stream: true for SSE (text/event-stream).Migrate from OpenAI by changing only the base URL and key.
Model libraryLaunch set covers Meta Llama‑4 Scout 17B, Qwen‑3 235B, Mixtral‑8×22B, DeepSeek‑R1 distills, SD‑XL 1.0 and more (text, code, vision).Lets teams A/B models or mix modalities without provisioning extra infra.
Elastic runtime“Zero rate limits, no cold starts.” Traffic is sharded over thousands of MI300X/MI250X/H100 GPUs, spun up on‑demand by our orchestration layer.Bursty workloads stay < 200 ms tail latency without you over‑allocating GPUs.
Cost modelTokens in, tokens out — billed per 1M tokens; images billed per megapixel. Every account starts with $5 free credit.Fine‑grained, deterministic spend; easy to embed in metered SaaS.
Security / privacyEnd‑to‑end TLS, org‑scoped API keys, full tenant isolation; we never log or train on user prompts or outputs.Meets GDPR, HIPAA and most vendor‑assessment checklists out of the box.
SustainabilityAll compute runs in hydro‑powered facilities; the vertical stack is 80% cheaper‑per‑token than hyperscalers.Fewer carbon (and budget) emissions per request.

Quick start

curl -X POST \
  https://inference.api.nscale.com/v1/chat/completions \
  -H "Authorization: Bearer $NSCALE_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "meta-llama/Llama-4-Scout-17B-Instruct",
        "messages": [{"role":"user","content":"Hello world"}],
        "stream": true
      }'