July 2025
Serverless Fine-tuning

Features
- Two-step workflow. Pick any supported base model (Llama 3, Mistral 7B, DeepSeek, Qwen and more) and launch a job with your dataset – no cluster sizing, no Dockerfiles
- LoRA-powered efficiency. Default Low-Rank-Adaptation (LoRA) reduces GPU hours and cost
- Live metrics & easy monitoring. Poll one endpoint to track train_loss, eval_loss, perplexity
- Export or deploy instantly. One-click push to Hugging Face or direct download of a ready-to-serve artefact
- Serverless pricing. $2 minimum per job, billed by processed tokens; every new account still gets $5 free credit to experiment
Quick start
March 2025
Serverless Inference

https://inference.api.nscale.com/v1/*
and get deterministic, low-latency responses from today’s best open-source models—all billed per token and delivered from data-sovereign, 100% renewable data-centres.Features
- OpenAI-compatible endpoints. Drop-in support for Llama, Qwen, DeepSeek and other leading models makes migration a copy-paste job
- Pay-as-you-go billing. Prices are per 1 million tokens including input and output tokens for Chat, Multimodal, Language and Code models. Image models is based on image size and steps
- 80% lower cost & 100% renewable. Our vertically-integrated stack slashes TCO versus hyperscalers while guaranteeing data privacy—requests are never logged or reused
- $5 free credits to get started. Every new account includes starter credits so you can ship to production in minutes
Under the hood
Area | What it looks like | Why it matters |
---|---|---|
API surface | Drop‑in equivalents for GET /models , POST /chat/completions , POST /images with optional stream: true for SSE (text/event-stream ). | Migrate from OpenAI by changing only the base URL and key. |
Model library | Launch set covers Meta Llama‑4 Scout 17B, Qwen‑3 235B, Mixtral‑8×22B, DeepSeek‑R1 distills, SD‑XL 1.0 and more (text, code, vision). | Lets teams A/B models or mix modalities without provisioning extra infra. |
Elastic runtime | “Zero rate limits, no cold starts.” Traffic is sharded over thousands of MI300X/MI250X/H100 GPUs, spun up on‑demand by our orchestration layer. | Bursty workloads stay < 200 ms tail latency without you over‑allocating GPUs. |
Cost model | Tokens in, tokens out — billed per 1M tokens; images billed per megapixel. Every account starts with $5 free credit. | Fine‑grained, deterministic spend; easy to embed in metered SaaS. |
Security / privacy | End‑to‑end TLS, org‑scoped API keys, full tenant isolation; we never log or train on user prompts or outputs. | Meets GDPR, HIPAA and most vendor‑assessment checklists out of the box. |
Sustainability | All compute runs in hydro‑powered facilities; the vertical stack is 80% cheaper‑per‑token than hyperscalers. | Fewer carbon (and budget) emissions per request. |