Hugging Face blog

We’re excited to announce our $15 million funding and as we celebrate this milestone, here’s a brief recap of our journey so far.

Alok, Tark & Saurav


6 min read

This blog explains how to take models from Hugging Face, a machine learning library, and package them into Nvidia Triton, an open-source inference serving software. This allows for easy deployment and serving of models in production environments.

The article is part of the Hugging Face blog and it also mentions deploying Triton Inference containers in GKE and efficient utilization of GPUs. If you're interested in deploying machine learning models, this information can be useful for you.

Step 1 - Getting started with Hugging face

a) Understanding Hugging Transformers

Hugging Face Transformers is an open-source library that provides state-of-the-art natural language processing (NLP) capabilities using pre-trained models based on deep learning techniques such as transformer models.

The library provides a high-level API for easily downloading and using pre-trained models for tasks such as text classification, question-answering, summarization, and more. The library also provides functionality for fine-tuning these pre-trained models on specific tasks using transfer learning.

Hugging Face Transformers supports a wide variety of pre-trained models, including BERT, GPT-2, Roberta, T5, and many others. The library also provides utilities for tokenization, data preprocessing, and evaluation of models.

b) Understanding Hugging Pipelines

Hugging Face pipelines are a convenient and user-friendly way to quickly apply pre-trained models from the Hugging Face Transformers library to a wide range of natural language processing (NLP) tasks.

Pipelines provide a simple API for running NLP tasks such as text classification, named entity recognition, text generation, and more, without requiring the user to have a deep understanding of the underlying models or their architecture.

To use a pipeline, you simply instantiate the pipeline object with the name of the task you want to perform and the name of the pre-trained model you want to use. For example, to perform text generation on some text using the GPT-Neo model, you would do the following:

from transformers import pipeline
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-1.3B')
result = generator('I really enjoyed this movie', do_sample=True, min_length=50)
# I really enjoyed this movie. I’m glad this movie was so big it won Best
# Picture at the Oscars. I really was not expecting the movie to do anything
# special, and it totally surprised me by doing so. The movie was awesome!

One of the key advantages of using pipelines is that they handle all of the data pre-processing and post-processing necessary for the task, such as tokenization and formatting of input and output data, making it very easy to get started with using pre-trained models for NLP tasks. Additionally, pipelines can be easily customized and extended to support new tasks or models, providing a flexible and modular approach to NLP.

Step 2 - Deploying a Huggingface model on Nvidia Triton

NVIDIA Triton (previously known as TensorRT Inference Server) is an open-source inference-serving software that simplifies the deployment of AI models at scale. To deploy a Hugging Face model on NVIDIA Triton, you'll need to follow these steps in two ways

  1. Convert to ONNX and push the files to Triton Model Repository
  2. Using Huggingface Pipeline with Template Method to deploy the model ( Recommended )

Although you might think the ONNX model will be faster, when we actually generated tokens, we found the pipeline to be much faster at generating tokens and need lesser code to get started

Here is how we did it you just need to break the above pipeline code into 2 parts

import app
import json
import triton_python_backend_utils as pb_utils
import numpy as np
from transformers import pipeline

inferless_model = app.InferlessPythonModel()

class TritonPythonModel:    
def initialize(self, args):
       self.generator = pipeline("text-generation", model="EleutherAI/gpt-neo-1.3B")    

def execute(self, requests):
       responses = []        
       for request in requests:
            # Decode the Byte Tensor into Text
            input = pb_utils.get_input_tensor_by_name(request, "prompt")
             input_string = input.as_numpy()[0].decode()
            # Call the Model pipeline
pipeline_output = self.generator(prompt, do_sample=True, min_length=50)
       generated_txt = pipeline_output[0]["generated_text"]
       output = generated_txt

# Encode the text to byte tensor to send back
                inference_response = pb_utils.InferenceResponse(
       return responses

   def finalize(self, args):
        self.generator = None

You also need to add pbconfig.txt so that triton understands how to process the model

name: "gpt-neo"
backend: "python"
input [
   name: "prompt"
   data_type: TYPE_STRING
   dims: [-1]  
output [
   name: "generated_text"
   data_type: TYPE_STRING
   dims: [-1]  

instance_group [
   kind: KIND_GPU  

Once you have these files you need to make sure you have them in the following folder structure

Step 3 - Deploying Triton Inference containers in GKE

Now once you have the Model Ready, Next Step is to deploy the Nvidia Triton and pass the model repo link. firstly go to the Amazon S3 ( you can also use Azure/ GCP buckets ). Create a bucket for your models.


In the bucket create a repo create a folder like model-repo and update the

Create a Kubernetes cluster and run the following commands the run kubectl with the following files. You should ass the AWS credentials to access you bucket

kubectl apply -f triton-deploy.yaml


apiVersion: apps/v1
kind: Deployment
 name: triton-deployment
   app: triton-server
     app: triton-server
 replicas: 1
  app: triton-server
     - name: serving
       - name: AWS_ACCESS_KEY_ID
         value: <Sample ID>
       - name: AWS_SECRET_ACCESS_KEY
         value: <Sample Key>
       - name: AWS_DEFAULT_REGION
         value: us-east-1        
       - name: grpc
         containerPort: 8001
       - name: http
         containerPort: 8000
       - name: metrics
         containerPort: 8002
command: [ "tritonserver", "--model-store=s3://<bucketname>/model_repo", "--model-control-mode=explicit", "--exit-on-error=false" ]

Once you have deployed the container you deploy the service

kubectl apply -f triton-service.yaml


apiVersion: v1
kind: Service
 name: triton-server
   app: triton-server
   app: triton-server  
   - protocol: TCP
     port: 80
     name: http
     targetPort: 8000
   - protocol: TCP
     port: 443
     name: https
     targetPort: 8000
   - protocol: TCP
     port: 8001
     name: grpc
     targetPort: 8001
   - protocol: TCP
     port: 8002
     name: metrics
     targetPort: 8002
 type: LoadBalancer

Once this is deployed you can find the IP of the service from the cluster and call the inference

curl --location --request POST 'http://<<IP-Address>>/v2/models/gpt-neo/infer' \\
--header 'Content-Type: application/json' \\
--data-raw '{
       "name": "prompt",
       "shape": [1],
       "datatype": "BYTES",
       "data":  ["I really enjoyed this"]

Step 4 - Efficient utilization of GPUs

To make sure that we use GPU efficiently, Triton gives you APIs to load and unload the models via APIs. You can use these controls:

You can call the POST API  to load the model

/v2/repository/models/<Model Name>/load

You can call the POST API  to unload the model

/v2/repository/models/<Model Name>/unload

Allowing multiple models to share the GPU memory. This can help optimize memory usage and improve performance.

When running machine learning models on GPUs, memory usage can be a limiting factor. GPUs typically have a limited amount of memory, and running multiple models in parallel can quickly exhaust that memory. This is where the technique mentioned in the article comes in.

By keeping only the active model in the memory, the technique allows multiple models to share the GPU memory. This means that when a model is not actively being used, its memory can be freed up for use by other models. This can help optimize memory usage and improve performance by allowing more efficient use of GPU resources.

I hope this explanation is helpful. Let me know if you have any further questions or feedback!