How AI4Bharat Is Building an Open-Source Language AI for Indian Languages‍

August 4, 2023

AI4Bharat is a center at IIT Madras, India, with the mission to bring parity in AI technologies in Indian languages with respect to English with open-source technologies.

The canvas of AI technologies has been dominated by the towering presence of English, leaving many native languages in the shadows. AI4Bharat’s mission encompasses building state-of-the-art, open, foundational AI models across various tasks for all 22 regional Indian languages.

You might have questioned whether AI could truly comprehend the complexities of Indian languages. The answer is a big yes. From language understanding to translation, from speech recognition to text-to-speech models, AI4Bharat leaves no stone unturned in ensuring that Indian languages are well understood and articulated by the very technology that empowers them.

AI4Bharat collaborates with partners to design and deploy reference applications, showcasing the immense potential of open AI models. Whether it’s video subtitling for educational content or aiding the recognition of sign languages from around the world, they enable an innovation ecosystem that empowers researchers, startups, and the government to unlock the true magic of Indian language AI.

The Language Models

AI4Bharat is at the forefront of driving multilingual excellence with its revolutionary language models. These models are designed for user-friendliness, empowering users to harness AI’s potential for multilingual applications. They cater to a diverse range of tasks and languages, enabling seamless communication and understanding.

The Indic Speech-to-Text Conformer

A 30M parameter ASR model with a conformer-based architecture, built to support real-time transcription for Indian languages. Trained on ULCA, KathBath, Shrutilipi, and MUCS datasets, it can be effortlessly deployed on Android devices through WebSocket.

The Indic Transliterate

Simplifies script conversions for 21 Indic languages. It is a transformer-based multilingual transliteration model with approximately 11M parameters, enabling easy conversion between Roman script and native scripts for 21 Indic languages. Its training on the extensive Aksharantar dataset, featuring 26 million word pairs across 20 Indic languages, ensures its accuracy and effectiveness.

The Indic Natural Language Generation

Empowers narrative creation across 11 Indian languages and English. It is a multilingual, sequence-to-sequence pre-trained model, built upon the mBART architecture. The model's versatility allows you to develop natural language generation applications for Indian languages through fine-tuning with supervised training data, encompassing tasks such as machine translation, summarization, and question generation.

The Indic Text-to-Speech

Synthesizes expressive voices, enhancing the TTS experience. It is focused on developing multi-speaker text-to-speech models for Indic languages. It involves two models - an acoustic model generates waveforms from text, while a vocoder model synthesizes voice. It revolutionizes our interaction with Indic languages, enhancing accessibility and user experience in voice assistants, audiobooks, language learning, and assistive technologies, fostering inclusivity and a richer digital ecosystem.

The Indic BERT

Defies complexity myths, delivering top-notch performance. It is a multilingual ALBERT model trained on a vast corpora encompassing 12 major Indian languages - Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu. Despite having fewer parameters compared to models like mBERT and XLM-R, the Indic BERT achieves exceptional performance across various tasks, making it a powerful tool for natural language processing in Indian languages and advancing language-based applications.

The Indic Named Entity Recognition

Guarantees accurate identification of entities in sentences for 11 languages, such as Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu. Through extensive fine-tuning on millions of sentences and thorough evaluation against human-annotated test sets and publicly available Indian NER datasets, it ensures dependable and precise entity recognition.

The Indic Speech2Speech (Experimental)

Bridges language gaps, facilitating speech translation between different languages. Its interface offers effortless language translation by utilizing ASR, NMT, and TTS, enabling speech-to-speech conversion across different languages.

The Indic Translation v2

Enables uninterrupted translation between English and 12 major Indian languages, using the advanced Transformers v2 architecture.

The Indic Speech-to-Text with Numbers

Offers exceptional accuracy in recognizing and parsing Indic Speech, even in the presence of numbers, owing to its ASR Conformer Models.

The Indic Speech-to-Text Whisperer

Provides precise transcriptions of spoken Indian languages, catering to diverse applications using an ASR model with Whisperer architecture.

AI4Bharat’s language models mark a significant leap in Indian language AI technology, enabling users to embrace linguistic diversity and explore limitless possibilities.

Areas of Impact

AI4Bharat’s strategic targets encompass four crucial aspects - data curation and creation for diverse tasks and 22 scheduled Indian languages; state-of-the-art AI model development across all 22 regional languages; design and deployment of reference applications in collaboration with partners; and fostering innovation through educational support for researchers, startups, and the government in Indian-language AI technology.

In their pursuit of advancing Indian language AI technology, they focus on several crucial areas, each geared towards empowering linguistic diversity and inclusivity. Through open-source initiatives, they curate and create datasets and models for neural machine translation, allowing integrated communication between English and 12 Indic languages.

Additionally, they address transliteration challenges with benchmarks, applications, and models bridging Roman and scripts for over 20 Indic languages. With a strong emphasis on accessibility, they offer open-source models for speech recognition in 9 Indian languages and text-to-speech synthesis for 13 languages, supporting both female and male speakers.

Their language understanding initiatives provide open-source language models, benchmarks, and entity recognizers for 10 Indian languages. They are working on sign languages as well, offering datasets and models for sign recognition in various sign languages worldwide.

Through tools like Shoonya and Chitralekha, AI4Bharat provides AI-assisted language work and video subtitling, prioritizing educational and media content. Lastly, Anuvaaad, their open-source tool, facilitates document-level translation with NMT and transliteration support. These areas of focus align with their core mission of fostering innovation and enabling an inclusive innovation ecosystem for Indian languages.

AI4Bharat Researchers Raising Seed Funding

Prepare for a transformative AI revolution as India's AI4Bharat secures $12 million in seed funding, backed by venture capital firms Peak XV and Lightspeed Venture. This substantial investment reflects the soaring interest in generative AI, inspired by OpenAI's ChatGPT success in human-like conversations.

Their recent mobile assistant launch breaks language barriers, offering government scheme information in multiple languages, and promoting inclusivity. Peak XV's inaugural investment post-rebranding reinforces AI4Bharat's ambitious pursuit of a brighter, more inclusive AI future, shaping the landscape of innovation.

Pioneering Open-Source AI for India’s Future

AI4Bharat stands at the forefront of open-source innovation, wielding the power of AI to conquer India's pressing socio-economic and environmental challenges. Guided by visionaries Prof. Pratyush Kumar and Prof. Mitesh Khapra, AI4Bharat delves deep into language technology, empowering machines to understand and engage with human texts and speech. A trailblazer in the field, they have released the largest corpus of Indian language texts, amplifying the potential impact of AI in the lives of billions who communicate predominantly in their native languages.

This novel project requires high-end GPUs and AI4Bharat has partnered with E2E Networks. E2E is equipped with the latest NVIDIA cards like the H100 and A100, which makes it an ideal choice.

Sign up for Free Trial

Latest Blogs

March 10, 2025

How AI4Bharat Is Building an Open-Source Language AI for Indian Languages‍

Table of Contents

The Language Models

The Indic Speech-to-Text Conformer

The Indic Transliterate

The Indic Natural Language Generation

The Indic Text-to-Speech

The Indic BERT

The Indic Named Entity Recognition

The Indic Speech2Speech (Experimental)

The Indic Translation v2

The Indic Speech-to-Text with Numbers

The Indic Speech-to-Text Whisperer

Areas of Impact

AI4Bharat Researchers Raising Seed Funding

Pioneering Open-Source AI for India’s Future

A Comparison between TIR Containerized VMs vs Traditional VMs

Accelerate Your AI Application Development Using TIR Containerized VMs

The AI Revolution in the Automotive Industry: Steering Toward a Smarter, Safer, and Sustainable Future

How to Build an AI Agent for Personalized Customer Experiences with LangGraph, LangChain and Gradio

Unleash Your AI Creativity at DeepSeek HackAIthon

The Cost-Effective AI Lab Solution for Indian Colleges: AILaaS by E2E Cloud

Breakthrough AI: Key Highlights from NVIDIA CES 2025, Las Vegas

Steps to Build an AI Agent Using LangGraph and Llama 3.1

How to Build a Knowledge Graph RAG Using cuGraph and Llama 3.1

Step-by-Step Guide to Building a Vision RAG System for Financial Insights