Resources/Learn/building-real-time-streaming-apps-with-nvidia-triton-inference-and-sse-over-http

Building Real-Time Streaming Apps with NVIDIA Triton Inference and SSE over HTTP

May 30, 2024
mins read
Nilesh Agarwal
Cofounder & CTO
Table of contents
Subscribe to our blog
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Introduction

In this blog, I'll guide you how to build real-time streaming AI applications by integrating SSE with NVIDIA Triton Inference Server using a Python backend and Zephyr model.

In the fast-paced world of machine learning and AI, the ability to process and infer data in real-time can significantly enhance the user experience and overall effectiveness of your applications. Imagine your app providing instant feedback, live updates, and seamless interaction, all powered by efficient real-time data streaming.

This blog can help you if you are looking to build an inference solution that can handle real-time data processing and inference without compromising on performance. Here is a demo of how-it works.

Let’s deep dive to build a similar application yourself.

Basics

What is SSE?

Server-Sent Events (SSE) is a standard that describes how servers can initiate data transmission towards browser clients once an initial client connection has been established. It’s particularly useful for creating a one-way communication channel from the server to the client, such as for real-time notifications, live updates, and streaming data.

What is NVIDIA Triton?

NVIDIA Triton Inference Server provides a robust, scalable serving system for deploying machine learning models from any framework (TensorFlow, PyTorch, ONNX Runtime, etc.) over GPUs and CPUs in production environments. Triton simplifies deploying multiple models from different frameworks, optimizing GPU utilization and integrating with Kubernetes.

Requirements

To get started, you'll need:

1. NVIDIA Triton Inference Server version 23.11+ docker image

2.‘pip’ to install autoawq==0.1.8 torch==2.1.2

3. model.py and config.pbtxt code for Zephyr inference ( Triton )

Getting Started

You can find the code for this integration on GitHub here.

How to Use the Model

1. Create a folder model_repo
2. Clone the repo inside the model_repo
3. Run the docker command  (Make sure you have docker and nvidia container toolkit setup)

if you don’t have GPUs


4.Install the dependencies inside docker with docker exec

5.Load the model explicitly

6.Run 'test.sh' to call inference

Key Advantages of Using SSE

1.Simplicity: SSE is straightforward to implement both on the server and the client side. Unlike WebSockets, which require a special protocol and server setup, SSE works over standard HTTP and can be handled by traditional web servers without any special configuration.

2.Efficient Real-time Communication: SSE is designed for scenarios where the server needs to push data to the client. It's very efficient for use cases like live notifications, feeds, and real-time analytics dashboards where updates are frequent and originate from the server.

3. Built-in Reconnection: SSE has automatic reconnection support. If the connection between the client and server is lost, the client will automatically attempt to reestablish the connection after a timeout. This makes it resilient and ensures continuous data flow without manual intervention.

4.Low Overhead: Compared to WebSockets, SSE sends data over a traditional HTTP connection and does not require a handshake after the initial connection setup. This reduces overhead and complexity, particularly in scenarios where only unidirectional communication is needed.

5.HTTP Standards Compliance: SSE operates over standard HTTP, making it compatible with existing web infrastructure like proxies, firewalls, and load balancers. This compliance also simplifies debugging and integration with other web technologies.

6. Native Browser Support: Most modern web browsers natively support SSE, making it easy to implement without requiring additional libraries or plugins for the client side.

7. Scalability: While SSE connections maintain an open line from the server to each client, modern web servers can handle many such connections simultaneously. This allows SSE to scale well for a large number of clients, especially when used in conjunction with a robust backend and efficient message distribution systems.

SSE is particularly advantageous in scenarios where the communication is predominantly from server to client.


Integrating Server-Sent Events (SSE) with NVIDIA Triton empowers your AI applications with real-time data streaming and efficient inference capabilities. Embrace this approach to deliver responsive, scalable solutions that keep your users engaged and your systems performing at their best.

Introduction

In this blog, I'll guide you how to build real-time streaming AI applications by integrating SSE with NVIDIA Triton Inference Server using a Python backend and Zephyr model.

In the fast-paced world of machine learning and AI, the ability to process and infer data in real-time can significantly enhance the user experience and overall effectiveness of your applications. Imagine your app providing instant feedback, live updates, and seamless interaction, all powered by efficient real-time data streaming.

This blog can help you if you are looking to build an inference solution that can handle real-time data processing and inference without compromising on performance. Here is a demo of how-it works.

Let’s deep dive to build a similar application yourself.

Basics

What is SSE?

Server-Sent Events (SSE) is a standard that describes how servers can initiate data transmission towards browser clients once an initial client connection has been established. It’s particularly useful for creating a one-way communication channel from the server to the client, such as for real-time notifications, live updates, and streaming data.

What is NVIDIA Triton?

NVIDIA Triton Inference Server provides a robust, scalable serving system for deploying machine learning models from any framework (TensorFlow, PyTorch, ONNX Runtime, etc.) over GPUs and CPUs in production environments. Triton simplifies deploying multiple models from different frameworks, optimizing GPU utilization and integrating with Kubernetes.

Requirements

To get started, you'll need:

1. NVIDIA Triton Inference Server version 23.11+ docker image

2.‘pip’ to install autoawq==0.1.8 torch==2.1.2

3. model.py and config.pbtxt code for Zephyr inference ( Triton )

Getting Started

You can find the code for this integration on GitHub here.

How to Use the Model

1. Create a folder model_repo
2. Clone the repo inside the model_repo
3. Run the docker command  (Make sure you have docker and nvidia container toolkit setup)

if you don’t have GPUs


4.Install the dependencies inside docker with docker exec

5.Load the model explicitly

6.Run 'test.sh' to call inference

Key Advantages of Using SSE

1.Simplicity: SSE is straightforward to implement both on the server and the client side. Unlike WebSockets, which require a special protocol and server setup, SSE works over standard HTTP and can be handled by traditional web servers without any special configuration.

2.Efficient Real-time Communication: SSE is designed for scenarios where the server needs to push data to the client. It's very efficient for use cases like live notifications, feeds, and real-time analytics dashboards where updates are frequent and originate from the server.

3. Built-in Reconnection: SSE has automatic reconnection support. If the connection between the client and server is lost, the client will automatically attempt to reestablish the connection after a timeout. This makes it resilient and ensures continuous data flow without manual intervention.

4.Low Overhead: Compared to WebSockets, SSE sends data over a traditional HTTP connection and does not require a handshake after the initial connection setup. This reduces overhead and complexity, particularly in scenarios where only unidirectional communication is needed.

5.HTTP Standards Compliance: SSE operates over standard HTTP, making it compatible with existing web infrastructure like proxies, firewalls, and load balancers. This compliance also simplifies debugging and integration with other web technologies.

6. Native Browser Support: Most modern web browsers natively support SSE, making it easy to implement without requiring additional libraries or plugins for the client side.

7. Scalability: While SSE connections maintain an open line from the server to each client, modern web servers can handle many such connections simultaneously. This allows SSE to scale well for a large number of clients, especially when used in conjunction with a robust backend and efficient message distribution systems.

SSE is particularly advantageous in scenarios where the communication is predominantly from server to client.


Integrating Server-Sent Events (SSE) with NVIDIA Triton empowers your AI applications with real-time data streaming and efficient inference capabilities. Embrace this approach to deliver responsive, scalable solutions that keep your users engaged and your systems performing at their best.

Table of contents