Chat

Prerequisites

Service Token: Sign up on the Nscale platform to create your service token.
Model Selection: Choose a chat model from Nscale’s library.
- Example: Llama 3.1 8B Instruct meta-llama/Llama-3.1-8B-Instruct

Step 1: Set up your environment

Before making requests, ensure you have the necessary tools installed for your language of choice: For Python: Install openai library

pip install openai

For Typescript: Install openai library

npm install openai

For cURL: Ensure cURL is installed on your system (it’s usually pre-installed on most Unix-based systems).

Step 2: Sending an inference request

Let’s walk through an example where we summarise a blog post into 100 words. Request structure Each request to the Nscale Chat Completions API endpoint should include the following:

Headers:
- "Authorization": "Bearer <SERVICE-TOKEN>"
- "Content-Type": "application/json"
Payload:
- "model": "<model id e.g., meta-llama/Llama-3.1-8B-Instruct>"
- "messages": "<array of messages to send to the model>"

Example use case: Summarise a blog post

import os
import openai

nscale_service_token = os.getenv("NSCALE_SERVICE_TOKEN")
nscale_base_url = "https://inference.api.nscale.com/v1"

client = openai.OpenAI(
    api_key=nscale_service_token,
    base_url=nscale_base_url
)

blog_text = "Serverless inference simplifies access to AI models..."

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "Provide a summary of the blog post in 100 words."},
        {"role": "user", "content": blog_text}
    ]
)

print(response.choices[0].message.content)

Step 3: Understanding the response

The API will return a JSON object containing the model’s output and token usage: Example Response:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "In this article, the author discusses the challenges of deploying Artificial Intelligence (AI) models in real-world applications..."
      }
    }
  ],
  "usage": {
    "completion_tokens": 175,
    "prompt_tokens": 1172,
    "total_tokens": 1347
  }
}

Key Fields:

choices: An array of message objects containing the model’s output.
usage: An object containing the input (prompt_tokens), output (completion_tokens), and total number of tokens used.

Step 4: Using the CLI for Chat Inferencing

You can also use the Nscale CLI to interact with chat models. This is a convenient way to test models or build command-line applications.

Prerequisites

Ensure you have the Nscale CLI installed

Examples

Here are some examples of using the CLI for chat inferencing:

# Generate a single response
nscale chat "What is machine learning?" -t $NSCALE_SERVICE_TOKEN -m meta-llama/Llama-3.1-8B-Instruct

# Start an interactive chat session
nscale chat -i -t $NSCALE_SERVICE_TOKEN -m meta-llama/Llama-3.1-8B-Instruct

# Use a custom system message
nscale chat --message "system:You are a creative storyteller" -t $NSCALE_SERVICE_TOKEN -m meta-llama/Llama-3.1-8B-Instruct

# Get usage statistics in JSON format
nscale chat "Explain quantum computing" --stats -t $NSCALE_SERVICE_TOKEN -m meta-llama/Llama-3.1-8B-Instruct

# Limit the response length
nscale chat "Write a story" --max-tokens 100 -t $NSCALE_SERVICE_TOKEN -m meta-llama/Llama-3.1-8B-Instruct

# Supply chat history
nscale chat -t $NSCALE_SERVICE_TOKEN -m meta-llama/Llama-3.1-8B-Instruct \
  --message "system:You are a helpful assistant" \
  --message "user:What is your name?"

# Start interactive mode with chat history
nscale chat -i -t $NSCALE_SERVICE_TOKEN -m meta-llama/Llama-3.1-8B-Instruct \
  --message "system:You are a helpful assistant" \
  --message "user:Hello" \
  --message "assistant:Hello! How can I help?"

# Use service token from environment variable
export NSCALE_SERVICE_TOKEN=your_service_token
nscale chat "What is machine learning?" -m meta-llama/Llama-3.1-8B-Instruct

For more details on CLI usage, refer to the CLI documentation.

Step 5: Monitoring and scaling

Nscale handles scaling automatically based on traffic patterns—no manual intervention needed! Use the Nscale Console to monitor:

API usage by model
Spend breakdowns

For custom models or high-throughput applications on dedicated endpoints, contact Nscale Support.

Troubleshooting

Common status codes and their meanings:

Status	Description	Response Format
200	Success (synchronous)	`application/json` response with completion
201	Success (streaming)	`text/event-stream` with delta updates
401	Invalid service token or unauthorized	Error object
404	Model not found or unavailable	Error object
429	Insufficient credit	Error object
500	Internal server error	Error object
503	Service temporarily unavailable	Error object

Success Response Format (200)

{
  "id": "cmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "..."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 56,
    "completion_tokens": 31,
    "total_tokens": 87
  }
}

Error Response Format

{
  "error": {
    "code": "TOO_MANY_REQUESTS",
    "message": "You have insufficient credit to run this request",
    "param": null,
    "error_type": "INSUFFICIENT_CREDIT"
  }
}

For the extensive list of error codes and handling, see the error code page By following this guide, you’ll be able to easily integrate chat models into your application using Nscale’s serverless inference service.

Contact Support

Need assistance? Get help from our support team

Getting Started

AI Services

Compute

Network

Storage

Manage

Prerequisites

Step 1: Set up your environment

Step 2: Sending an inference request

Step 3: Understanding the response

Step 4: Using the CLI for Chat Inferencing

Prerequisites

Examples

Step 5: Monitoring and scaling

Troubleshooting

Success Response Format (200)

Error Response Format

Contact Support

Getting Started

AI Services

Compute

Network

Storage

Manage

Documentation Index

​Prerequisites

​Step 1: Set up your environment

​Step 2: Sending an inference request

​Step 3: Understanding the response

​Step 4: Using the CLI for Chat Inferencing

​Prerequisites

​Examples

​Step 5: Monitoring and scaling

​Troubleshooting

​Success Response Format (200)

​Error Response Format

Contact Support

Prerequisites

Step 1: Set up your environment

Step 2: Sending an inference request

Step 3: Understanding the response

Step 4: Using the CLI for Chat Inferencing

Prerequisites

Examples

Step 5: Monitoring and scaling

Troubleshooting

Success Response Format (200)

Error Response Format