This guide will walk you through integrating a chat model into your application using Nscale’s API. With our serverless architecture, you can focus on building your application without worrying about infrastructure management.
Let’s walk through an example where we summarise a blog post into 100 words.Request structureEach request to the Nscale Chat Completions API endpoint should include the following:
Headers:
"Authorization": "Bearer <SERVICE-TOKEN>"
"Content-Type": "application/json"
Payload:
"model": "<model id e.g., meta-llama/Llama-3.1-8B-Instruct>"
"messages": "<array of messages to send to the model>"
Example use case: Summarise a blog post
import osimport openainscale_service_token = os.getenv("NSCALE_SERVICE_TOKEN")nscale_base_url = "https://inference.api.nscale.com/v1"client = openai.OpenAI( api_key=nscale_service_token, base_url=nscale_base_url)blog_text = "Serverless inference simplifies access to AI models..."response = client.chat.completions.create( model="meta-llama/Llama-3.1-8B-Instruct", messages=[ {"role": "system", "content": "Provide a summary of the blog post in 100 words."}, {"role": "user", "content": blog_text} ])print(response.choices[0].message.content)
Here are some examples of using the CLI for chat inferencing:
# Generate a single responsenscale chat "What is machine learning?" -t $NSCALE_SERVICE_TOKEN -m meta-llama/Llama-3.1-8B-Instruct# Start an interactive chat sessionnscale chat -i -t $NSCALE_SERVICE_TOKEN -m meta-llama/Llama-3.1-8B-Instruct# Use a custom system messagenscale chat --message "system:You are a creative storyteller" -t $NSCALE_SERVICE_TOKEN -m meta-llama/Llama-3.1-8B-Instruct# Get usage statistics in JSON formatnscale chat "Explain quantum computing" --stats -t $NSCALE_SERVICE_TOKEN -m meta-llama/Llama-3.1-8B-Instruct# Limit the response lengthnscale chat "Write a story" --max-tokens 100 -t $NSCALE_SERVICE_TOKEN -m meta-llama/Llama-3.1-8B-Instruct# Supply chat historynscale chat -t $NSCALE_SERVICE_TOKEN -m meta-llama/Llama-3.1-8B-Instruct \ --message "system:You are a helpful assistant" \ --message "user:What is your name?"# Start interactive mode with chat historynscale chat -i -t $NSCALE_SERVICE_TOKEN -m meta-llama/Llama-3.1-8B-Instruct \ --message "system:You are a helpful assistant" \ --message "user:Hello" \ --message "assistant:Hello! How can I help?"# Use service token from environment variableexport NSCALE_SERVICE_TOKEN=your_service_tokennscale chat "What is machine learning?" -m meta-llama/Llama-3.1-8B-Instruct
{ "error": { "code": "TOO_MANY_REQUESTS", "message": "You have insufficient credit to run this request", "param": null, "error_type": "INSUFFICIENT_CREDIT" }}
For the extensive list of error codes and handling, see the error code pageBy following this guide, you’ll be able to easily integrate chat models into your application using Nscale’s serverless inference service.