Serverless GPU Pricing - Pay per second, for exactly what you use

Get started with 10 hours of free credit, no credit card required.


Designed for small teams and independent developers looking to deploy their models in minutes without worrying about the cost.

Discounted Price

Built for fast-growing startups and larger organizations looking to scale quickly at an affordable cost with desired latency results

Pay for exactly what you use

Kickstart your compute journey with $30 free credit


TypePricing per secondPricing per hourGPUs




Nvidia A100





Nvidia A100






80GB 200GB

Nvidia A10







Nvidia A10





Nvidia T4





Nvidia T4




Join Private Beta

Built for fast-growing startups

Min 10,000 Inference Requests per month
Unlimited deployed webhook endpoints
GPU concurrency of 5
15 day of log retention
Support via private Slack connect within 48 working hours
Include Credits : $30

Built for Enterprises

Min 100,000 Inference Requests per month
Unlimited deployed webhook endpoints
GPU concurrency of 50
365 day of log retention
Support via private Slack connect & support engineer
Include Credits : Custom


How does your billing works?

With Inferless, you only pay for the compute resources used to run your models. Our pricing is based on two factors:

Duration - You are charged for the total number of seconds your models are running in a healthy state, rounded up to the nearest second. The duration is calculated from when your model starts loading until it finishes processing a request.
Machine type - We offer different machine types like A100, A10 and T4 GPUs. The price per second varies based on the machine type you choose for your model. More powerful machines cost more per second.

For example, let’s say you have kept autoscaling on for 2 machines as maximum concurrency and used dedicated A100 80GB as machine type.

It runs on 1 machine for 14,400 sec (4 hrs) and 2 machines for 10,800 sec (3 hrs).
With A100 costing $0.0014/sec, total usage is 25,200 sec.
Your monthly bill is 25,200 * $0.0014 = $35.28.

What kind of applications i can deploy using Inferless?

You can deploy any machine learning model that runs on GPU workloads. For example, any compute-intensive deep learning workloads across various applications from computer vision, NLP, recommendations and scientific computing. Some of the most popular models that our users have deployed with us are Llama2 13Bn, Stable diffusion control-net,Vicuna 7B etc

What is the difference between Shared and Dedicated Instance?

The distinction between shared and dedicated instances revolves around resource allocation and performance. Shared instances allocate GPU resources among several users, offering variable performance at a cost-effective rate, making them suitable for smaller or infrequent tasks. In contrast, dedicated instances grant users exclusive access to an entire GPU, delivering consistent high performance but at a higher cost. This setup is optimal for large-scale tasks or when data isolation is a priority. Your choice should hinge on workload demands, desired performance, and budget.

Do you offer discounts for startups?

We provide a $30 free credit to help you kickstart. Since, we are currently in private beta, your use-case & stage needs to match with our criteria. You can read more about it here.

How secure is Inferless?

Customer data and privacy is our top priority. Inferless execution environments are completely isolated from each other using Docker containerization. This prevents any interaction between individual customer environments.
All log streams are separated securely using AWS CloudWatch Logs access controls. Logs are retained only for 30 days and then deleted as per our strict data retention policies.
The storage used for model hosting is encrypted using AES-256 encryption. Models and data are not shared across customers.

If I don't run any inference, will I still be charged?

With Inferless, you only pay for what you use. When minimum replicas is set to zero, no machines are spun up. So you are not charged when there are no inference requests.

What GPUs are available?

We run our workload on Nvidia on A100, A10, and T4 GPUs, so that you can have a blazing-fast inference.

How does your pricing work?

Pay for what you use: Per-second billing. No upfront costs. We typically predict upto 80% cost savings.

What is the latency?

For your first-time calls, you may have a cold start of 10-20s, but successive calls will depend only on inference time.

Does it support large custom models?

Yes we support for model size of upto 16GB, for larger size models feel free to speak with us and we will help you out

What GPUs are available?

We run our workload on Nvidia on A100, A10, and T4 GPUs, so that you can have a blazing-fast inference.

Is my data secure ?

Yes, Your models are deployed in a completely isolated environment, for storage the models are encrypted at rest.

Can I change/cancel my plan anytime?

Yes, You can set up an upper limit for each project/workspace. You can disable the model anytime to stop billing.