Scaling Hugging Face Models with Nvidia Triton Inference Server

Table of contents

Introduction

Nvidia Triton Inference Server has experienced significant growth as a model deployment tool, thanks to its robust features, scalability, and flexibility. Combined with the expansive library of Hugging Face, which offers state-of-the-art natural language processing capabilities, it opens up immense possibilities for AI deployments. In this tutorial, we'll explain how to efficiently package and deploy Hugging Face models using Nvidia Triton Inference Server, making them production-ready in no time.

By harnessing the power of these tools, you can efficiently and effortlessly manage machine learning models at scale. We'll also show you how to deploy Triton Inference containers in GKE and make efficient use of GPUs. If you're interested in deploying multiple machine learning models at scale with infrastructure specifically designed around Hugging Face, this information can be useful for you.

Step 1 - How to use Hugging Face Pipelines

To use a pipeline, you simply instantiate the pipeline object with the name of the task you want to perform and the name of the pre-trained model you want to use. For example, to perform text generation on some text using the GPT-Neo model, you would do the following:

One of the key advantages of using pipelines is that they handle all of the data pre-processing and post-processing necessary for the task, such as tokenization and formatting of input and output data, making it very easy to get started with using pre-trained models for NLP tasks. Additionally, pipelines can be easily customized and extended to support new tasks or models, providing a flexible and modular approach to NLP.

Step 2 - Deploying a Hugging Face model on Nvidia Triton

NVIDIA Triton (previously known as TensorRT Inference Server) is an open-source inference-serving software that simplifies the deployment of AI models at scale. To deploy a Hugging Face model on NVIDIA Triton, you'll need to follow these steps in two ways

Convert to ONNX and push the files to Triton Model Repository
Using Huggingface Pipeline with Template Method to deploy the model ( Recommended )

We found the Huggingface Pipeline to be much faster at generating tokens and need lesser code to get started.

Here is how we did it you just need to break the above pipeline code into 2 parts

Part 1 : Wrapping the Code in model.py

Step 2 : Create a config file

You also need to add pbconfig.txt so that Triton understands how to process the model

‍

Once you have these files you need to make sure you have them in the following folder structure

Step 3 - Deploying Triton Inference containers in GKE

Now once you have the Model Ready, Next Step is to deploy the Nvidia Triton and pass the model repo link. Firstly go to the Amazon S3 ( you can also use Azure/ GCP buckets ). Create a bucket for your models.

s3://model-bucket/model-repo/<Model Folder>

In the bucket create a repo create a folder like model-repo and push the files to the s3 bucket with the name gpt-neo.

You should assign the AWS credentials to access your S3 bucket and keep it ready for the next step.

triton-deploy.yaml

Next a Kubernetes cluster and run the following commands the run kubectl with the following files.

kubectl apply -f triton-deploy.yaml

Once you have deployed the container you deploy the service

triton-service.yaml

Apply the file to create a service

kubectl apply -f triton-service.yaml

You can go to the Cluster → Services tab to find the External IP for the service.

Once this is deployed and have the IP and use the below command to call the inference

Step 4 - Efficient utilization of GPUs

To make sure that we use GPU efficiently, Triton gives you APIs to load and unload the models via APIs. You can use these controls:

You can call the POST API to load the model
/v2/repository/models/<Model Name>/load

You can call the POST API to unload the model
/v2/repository/models/<Model Name>/unload

Allowing multiple models to share the GPU memory. This can help optimize memory usage and improve performance.

When running machine learning models on GPUs, memory usage can be a limiting factor. GPUs typically have a limited amount of memory, and running multiple models in parallel can quickly exhaust that memory. This is where the technique mentioned in the article comes in.

By keeping only the active model in the memory, the technique allows multiple models to share the GPU memory. This means that when a model is not actively being used, its memory can be freed up for use by other models. This can help optimize memory usage and improve performance by allowing more efficient use of GPU resources.

‍

Conclusion

In this Tutorial, we explained how to deploy Hugging Face models on Nvidia Triton Inference Server for easy deployment and serving of models in production environments. It covers steps for using Hugging Face Pipelines, deploying a Hugging Face model on Nvidia Triton, deploying Triton Inference containers in GKE, and efficient utilization of GPUs to optimize memory usage and improve performance.

‍

Introduction

Step 1 - How to use Hugging Face Pipelines

Step 2 - Deploying a Hugging Face model on Nvidia Triton

Convert to ONNX and push the files to Triton Model Repository
Using Huggingface Pipeline with Template Method to deploy the model ( Recommended )

We found the Huggingface Pipeline to be much faster at generating tokens and need lesser code to get started.

Here is how we did it you just need to break the above pipeline code into 2 parts

Part 1 : Wrapping the Code in model.py

Step 2 : Create a config file

You also need to add pbconfig.txt so that Triton understands how to process the model

‍

Once you have these files you need to make sure you have them in the following folder structure

Step 3 - Deploying Triton Inference containers in GKE

s3://model-bucket/model-repo/<Model Folder>

In the bucket create a repo create a folder like model-repo and push the files to the s3 bucket with the name gpt-neo.

You should assign the AWS credentials to access your S3 bucket and keep it ready for the next step.

triton-deploy.yaml

Next a Kubernetes cluster and run the following commands the run kubectl with the following files.

kubectl apply -f triton-deploy.yaml

Once you have deployed the container you deploy the service

triton-service.yaml

Apply the file to create a service

kubectl apply -f triton-service.yaml

You can go to the Cluster → Services tab to find the External IP for the service.

Once this is deployed and have the IP and use the below command to call the inference

Step 4 - Efficient utilization of GPUs

To make sure that we use GPU efficiently, Triton gives you APIs to load and unload the models via APIs. You can use these controls:

You can call the POST API to load the model
/v2/repository/models/<Model Name>/load

You can call the POST API to unload the model
/v2/repository/models/<Model Name>/unload

Allowing multiple models to share the GPU memory. This can help optimize memory usage and improve performance.

‍

Conclusion

‍

Table of contents

Text Link

How to Deploy Hugging Face Models on Nvidia Triton Inference Server at Scale

Introduction

Step 1 - How to use Hugging Face Pipelines

Step 2 - Deploying a Hugging Face model on Nvidia Triton

Step 3 - Deploying Triton Inference containers in GKE

Step 4 - Efficient utilization of GPUs

Conclusion

Introduction

Step 1 - How to use Hugging Face Pipelines

Step 2 - Deploying a Hugging Face model on Nvidia Triton

Step 3 - Deploying Triton Inference containers in GKE

Step 4 - Efficient utilization of GPUs

Conclusion

Join the serverless revolution today