What Is RIVA Speech Skills Container In NVIDIA GPU Cloud?

April 2, 2025

Introduction:

NVIDIA Graphics Processing Units (GPUs) are computing platforms transforming big data into intelligent information. These are available on the cloud on demand and consist of different containers such as TensorRT and RIVA Speech Skills Container. This article focuses on the RIVA Speech Skills Container in NVIDIA GPU Cloud.

What is RIVA?

RIVA is composed of one or multiple NVIDIA TAO Toolkit models and pre as well as post-processing components to handle deployments of full pipelines. To run RIVA Sever, TAO Toolkit models should be exported to an efficient inference engine. RIVA model repository generation refers to generating the inference engines and process of artifacts gathering like models, configurations, files, and user settings.

RIVA model repository generation has three phases. Take a look:

The development phase includes model creation and development of these models using NeMo or TAO Toolkit.
The building phase includes the deployment of necessary artifacts into an intermediate file RIVA Model Intermediate Representation (RMIR).
The deployment phase converts the RMIR file into a RIVA model repository and the neural networks in Nemo or TAO Toolkit format.

NVIDIA RIVA:
It is a GPU-accelerated Software Development Kit (SDK) to build speech Artificial Intelligence (AI) applications customised for clients’ use cases and deliver real-time performance.

NVIDIA RIVA applications:
Automatic speech recognition: RIVA has pre-trained models in NVIDIA NGC that are fine-tunable with TAO Toolkit on a custom data set. It accelerates the development of domain-specific models by 10X. These TAO models are easily optimized, deployed, and exported as a speech service in the cloud using a single command. RIVA services are applicable for high-throughput offline use cases and gRPC-based microservices for low-latency streaming.

RIVA is containerised and offers exceptional world-class automatic speech recognition for any deployment or domain platform. It handles thousands of audio streams as input and returns streaming transcripts with low latency. RIVA pipelines can be tuned for various languages, domains, accents, context, and vocabulary. It has a GPU-optimised end-to-end pipeline including a customizable decoder, feature extraction, language models, punctuation, and acoustic.

Key features:

Automatic punctuation
Multiple model architectures
Inverse text normalisation
World-level timestamps
Optimized for T4 GPUs, A100, and V100

Text to Speech: RIVA has human-like text-to-speech neural voices used in spectrogram generation and vocoder models. It takes raw text as input and returns audio chunks through streaming mode or at the end of the entire sequence in batch mode.

Key features:

Expressive neural voices through SOTA models
Fine-grained control during expressivity and on voice pitch
6X higher inference performance
Support T4 GPUs, A100, and V100

NVIDIA GPU Cloud RIVA Speech Skills Container:
NVIDIA RIVA Speech Skills container is a Docker image that contains a toolset for production-grade conversational AI inference. The API server exhibits a simple API to perform speech recognition, a range of natural language processing inferences, and speech synthesis.

Procedure to run NVIDIA GPU Cloud RIVA Speech Skills Container:

A user can select the Tags tab and find out the container image release that they want to run.
The icon in the Pull Tag column should then be selected to copy the docker pull command.
Next step is pasting the pull command by opening a command prompt. Doing this begins the pulling of the container image. The user must ensure that the pull completes successfully.
Lastly, run the container image.

NVIDIA RIVA Speech Skills highlights:

Simple fine-tuning with NVIDIA TAO Toolkit
Streaming as well as batch speech recognition
Pretrained models
Helm-managed cloud deployment
Natural Language Processing models

Benefits:

It has state-of-the-art AI innovations built by model architecture, inference optimisations, training techniques, and deployment solutions.
Flexible and fully customisable at every step from modifying model architecture, customising pipelines, fine-tuning models, and deployment on any platform.
Performance optimisations across the entire stack of models.

The NVIDIA RIVA Automatic Speech Recognition provides real-time precise transcriptions. It provides input through a microphone or .wav file from the device. The sample duration is limited to 30 seconds. RIVA skills can be used after meeting the following prerequisites:

Access and log in to NVIDIA NGC
Access to an NVIDIA Turing, NVIDIA Volta, NVIDIA Ampere Architecture-based A100 GPU
Docker installed in support of NVIDIA GPUs

RIVA Skills models available for deployment are: