Prerequisites

  1. API Key: Sign up on the Nscale platform to get your API key.

  2. Model Selection: Choose a chat model from Nscale’s library.

    • Example: Llama 3.1 8B Instruct meta-llama/Llama-3.1-8B-Instruct

Step 1: Set up your environment

Before making requests, ensure you have the necessary tools installed for your language of choice:

For Python: Install openai library

pip install openai

For Typescript: Install openai library

npm install openai

For cURL: Ensure cURL is installed on your system (it’s usually pre-installed on most Unix-based systems).

Step 2: Sending an inference request

Let’s walk through an example where we summarise a blog post into 100 words.

Request structure

Each request to the Nscale Chat Completions API endpoint should include the following:

  1. Headers:

    • "Authorization": "Bearer <API-KEY>"

    • "Content-Type": "application/json"

  2. Payload:

    • "model": "<model id e.g., meta-llama/Llama-3.1-8B-Instruct>"

    • "messages": "<array of messages to send to the model>"

Example use case: Summarise a blog post

import os
import openai

nscale_api_key = os.getenv("NSCALE_API_KEY")
nscale_base_url = "https://inference.api.nscale.com/v1"

client = openai.OpenAI(
    api_key=nscale_api_key,
    base_url=nscale_base_url
)

blog_text = "Serverless inference simplifies access to AI models..."

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "Provide a summary of the blog post in 100 words."},
        {"role": "user", "content": blog_text}
    ]
)

print(response.choices[0].message.content)

Step 3: Understanding the response

The API will return a JSON object containing the model’s output and token usage:

Example Response:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "In this article, the author discusses the challenges of deploying Artificial Intelligence (AI) models in real-world applications..."
      }
    }
  ],
  "usage": {
    "completion_tokens": 175,
    "prompt_tokens": 1172,
    "total_tokens": 1347
  }
}

Key Fields:

  • choices: An array of message objects containing the model’s output.

  • usage: An object containing the input (prompt_tokens), output (completion_tokens), and total number of tokens used.

Step 4: Using the CLI for Chat Inferencing

You can also use the Nscale CLI to interact with chat models. This is a convenient way to test models or build command-line applications.

Prerequisites

Examples

Here are some examples of using the CLI for chat inferencing:

# Generate a single response
nscale chat "What is machine learning?" -a $NSCALE_API_KEY -m meta-llama/Llama-3.1-8B-Instruct

# Start an interactive chat session
nscale chat -i -a $NSCALE_API_KEY -m meta-llama/Llama-3.1-8B-Instruct

# Use a custom system message
nscale chat --message "system:You are a creative storyteller" -a $NSCALE_API_KEY -m meta-llama/Llama-3.1-8B-Instruct

# Get usage statistics in JSON format
nscale chat "Explain quantum computing" --stats -a $NSCALE_API_KEY -m meta-llama/Llama-3.1-8B-Instruct

# Limit the response length
nscale chat "Write a story" --max-tokens 100 -a $NSCALE_API_KEY -m meta-llama/Llama-3.1-8B-Instruct

# Supply chat history
nscale chat -a $NSCALE_API_KEY -m meta-llama/Llama-3.1-8B-Instruct \
  --message "system:You are a helpful assistant" \
  --message "user:What is your name?"

# Start interactive mode with chat history
nscale chat -i -a $NSCALE_API_KEY -m meta-llama/Llama-3.1-8B-Instruct \
  --message "system:You are a helpful assistant" \
  --message "user:Hello" \
  --message "assistant:Hello! How can I help?"

# Use API key from environment variable
export NSCALE_API_KEY=your_api_key
nscale chat "What is machine learning?" -m meta-llama/Llama-3.1-8B-Instruct

For more details on CLI usage, refer to the CLI documentation.

Step 5: Monitoring and scaling

Nscale handles scaling automatically based on traffic patterns—no manual intervention needed! Use the Nscale Console to monitor:

  • API usage by model

  • Spend breakdowns

For custom models or high-throughput applications on dedicated endpoints, contact Nscale Support.

Troubleshooting

Common status codes and their meanings:

StatusDescriptionResponse Format
200Success (synchronous)application/json response with completion
201Success (streaming)text/event-stream with delta updates
401Invalid API key or unauthorizedError object
404Model not found or unavailableError object
429Insufficient creditError object
500Internal server errorError object
503Service temporarily unavailableError object

Success Response Format (200)

{
  "id": "cmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "..."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 56,
    "completion_tokens": 31,
    "total_tokens": 87
  }
}

Error Response Format

{
  "error": {
    "code": "TOO_MANY_REQUESTS",
    "message": "You have insufficient credit to run this request",
    "param": null,
    "error_type": "INSUFFICIENT_CREDIT"
  }
}

For the extensive list of error codes and handling, see the error code page

By following this guide, you’ll be able to easily integrate chat models into your application using Nscale’s serverless inference service.

Contact Support

Need assistance? Get help from our support team