Hugging Face
We’re excited to announce our $15 million funding and as we celebrate this milestone, here’s a brief recap of our journey so far.

Alok, Tark & Saurav
Annoucement
6 min read
.png)
This blog explains how to take models from Hugging Face, a machine learning library, and package them into Nvidia Triton, an open-source inference serving software. This allows for easy deployment and serving of models in production environments.
The article is part of the Hugging Face blog and it also mentions deploying Triton Inference containers in GKE and efficient utilization of GPUs. If you're interested in deploying machine learning models, this information can be useful for you.
Step 1 - Getting started with Hugging face
a) Understanding Hugging Transformers
Hugging Face Transformers is an open-source library that provides state-of-the-art natural language processing (NLP) capabilities using pre-trained models based on deep learning techniques such as transformer models.
The library provides a high-level API for easily downloading and using pre-trained models for tasks such as text classification, question-answering, summarization, and more. The library also provides functionality for fine-tuning these pre-trained models on specific tasks using transfer learning.
Hugging Face Transformers supports a wide variety of pre-trained models, including BERT, GPT-2, Roberta, T5, and many others. The library also provides utilities for tokenization, data preprocessing, and evaluation of models.
b) Understanding Hugging Pipelines
Hugging Face pipelines are a convenient and user-friendly way to quickly apply pre-trained models from the Hugging Face Transformers library to a wide range of natural language processing (NLP) tasks.
Pipelines provide a simple API for running NLP tasks such as text classification, named entity recognition, text generation, and more, without requiring the user to have a deep understanding of the underlying models or their architecture.
To use a pipeline, you simply instantiate the pipeline object with the name of the task you want to perform and the name of the pre-trained model you want to use. For example, to perform text generation on some text using the GPT-Neo model, you would do the following:
from transformers import pipeline
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-1.3B')
result = generator('I really enjoyed this movie', do_sample=True, min_length=50)
print(result)
# I really enjoyed this movie. I’m glad this movie was so big it won Best
# Picture at the Oscars. I really was not expecting the movie to do anything
# special, and it totally surprised me by doing so. The movie was awesome!
One of the key advantages of using pipelines is that they handle all of the data pre-processing and post-processing necessary for the task, such as tokenization and formatting of input and output data, making it very easy to get started with using pre-trained models for NLP tasks. Additionally, pipelines can be easily customized and extended to support new tasks or models, providing a flexible and modular approach to NLP.
Step 2 - Deploying a Huggingface model on Nvidia Triton
NVIDIA Triton (previously known as TensorRT Inference Server) is an open-source inference-serving software that simplifies the deployment of AI models at scale. To deploy a Hugging Face model on NVIDIA Triton, you'll need to follow these steps in two ways
- Convert to ONNX and push the files to Triton Model Repository
- Using Huggingface Pipeline with Template Method to deploy the model ( Recommended )
Although you might think the ONNX model will be faster, when we actually generated tokens, we found the pipeline to be much faster at generating tokens and need lesser code to get started
Here is how we did it you just need to break the above pipeline code into 2 parts
import app
import json
import triton_python_backend_utils as pb_utils
import numpy as np
from transformers import pipeline
inferless_model = app.InferlessPythonModel()
class TritonPythonModel:
def initialize(self, args):
self.generator = pipeline("text-generation", model="EleutherAI/gpt-neo-1.3B")
def execute(self, requests):
responses = []
for request in requests:
# Decode the Byte Tensor into Text
input = pb_utils.get_input_tensor_by_name(request, "prompt")
input_string = input.as_numpy()[0].decode()
# Call the Model pipeline
pipeline_output = self.generator(prompt, do_sample=True, min_length=50)
generated_txt = pipeline_output[0]["generated_text"]
output = generated_txt
# Encode the text to byte tensor to send back
inference_response = pb_utils.InferenceResponse(
output_tensors=[
pb_utils.Tensor(
"generated_text",
np.array([output.encode()]),
)
]
)
responses.append(inference_response)
return responses
def finalize(self, args):
self.generator = None
You also need to add pbconfig.txt so that triton understands how to process the model
name: "gpt-neo"
backend: "python"
input [
{
name: "prompt"
data_type: TYPE_STRING
dims: [-1]
}
]
output [
{
name: "generated_text"
data_type: TYPE_STRING
dims: [-1]
}
]
instance_group [
{
kind: KIND_GPU
}
]
Once you have these files you need to make sure you have them in the following folder structure
.png)
Step 3 - Deploying Triton Inference containers in GKE
Now once you have the Model Ready, Next Step is to deploy the Nvidia Triton and pass the model repo link. firstly go to the Amazon S3 ( you can also use Azure/ GCP buckets ). Create a bucket for your models.
s3://model-bucket
In the bucket create a repo create a folder like model-repo and update the
Create a Kubernetes cluster and run the following commands the run kubectl with the following files. You should ass the AWS credentials to access you bucket
kubectl apply -f triton-deploy.yaml
triton-deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-deployment
labels:
app: triton-server
spec:
selector:
matchLabels:
app: triton-server
replicas: 1
template:
metadata:
labels:
app: triton-server
spec:
containers:
- name: serving
image: nvcr.io/nvidia/tritonserver:22.08-py3
env:
- name: AWS_ACCESS_KEY_ID
value: <Sample ID>
- name: AWS_SECRET_ACCESS_KEY
value: <Sample Key>
- name: AWS_DEFAULT_REGION
value: us-east-1
ports:
- name: grpc
containerPort: 8001
- name: http
containerPort: 8000
- name: metrics
containerPort: 8002
resources:
limits:
nvidia.com/gpu: 1
command: [ "tritonserver", "--model-store=s3://<bucketname>/model_repo", "--model-control-mode=explicit", "--exit-on-error=false" ]
Once you have deployed the container you deploy the service
kubectl apply -f triton-service.yaml
triton-service.yaml
apiVersion: v1
kind: Service
metadata:
name: triton-server
labels:
app: triton-server
spec:
selector:
app: triton-server
ports:
- protocol: TCP
port: 80
name: http
targetPort: 8000
- protocol: TCP
port: 443
name: https
targetPort: 8000
- protocol: TCP
port: 8001
name: grpc
targetPort: 8001
- protocol: TCP
port: 8002
name: metrics
targetPort: 8002
type: LoadBalancer
Once this is deployed you can find the IP of the service from the cluster and call the inference
curl --location --request POST 'http://<<IP-Address>>/v2/models/gpt-neo/infer' \\
--header 'Content-Type: application/json' \\
--data-raw '{
"inputs":[
{
"name": "prompt",
"shape": [1],
"datatype": "BYTES",
"data": ["I really enjoyed this"]
}
]
}
'
Step 4 - Efficient utilization of GPUs
To make sure that we use GPU efficiently, Triton gives you APIs to load and unload the models via APIs. You can use these controls:
You can call the POST API to load the model
/v2/repository/models/<Model Name>/load
You can call the POST API to unload the model
/v2/repository/models/<Model Name>/unload
Allowing multiple models to share the GPU memory. This can help optimize memory usage and improve performance.
When running machine learning models on GPUs, memory usage can be a limiting factor. GPUs typically have a limited amount of memory, and running multiple models in parallel can quickly exhaust that memory. This is where the technique mentioned in the article comes in.
By keeping only the active model in the memory, the technique allows multiple models to share the GPU memory. This means that when a model is not actively being used, its memory can be freed up for use by other models. This can help optimize memory usage and improve performance by allowing more efficient use of GPU resources.
I hope this explanation is helpful. Let me know if you have any further questions or feedback!