Rate Limits
Rate limits define the maximum number of requests a user can make to Nscale’s serverless inference service within a given time frame.
Rate limits are applied to ensure efficient use of resources, maintain system stability, and provide fair access to all users. These limits may vary based on the type of model, your subscription plan, or specific API endpoints.
Purpose of rate limits
The implementation of rate limits serves several critical purposes:
-
Protecting resources: Rate limits prevent resource exhaustion by ensuring that no single user or process monopolises system resources. This is especially important in serverless environments where scaling is automatic but not free.
-
Ensuring fair access: By capping the number of requests per user or API key, rate limits ensure equitable access to services for all users.
-
Preventing abuse: They act as a safeguard against malicious activities such as Distributed Denial of Service (DDoS) attacks or brute force attempts.
-
Cost management: Rate limits help control operational costs by preventing runaway resource consumption due to bugs or heavy traffic spikes.
How we enforce rate limits
Nscale doesn’t enforce rate limits for serverless inference, allowing you to scale dynamically without artificial constraints—your workload is only limited by your allocated resources, ensuring consistent performance even under high demand.