India, with its 1.4 billion diverse citizens conversing in 22 official languages and numerous regional dialects, boasts an unparalleled linguistic landscape. The digital realm is rapidly expanding, with 700 million internet users and 450 million Indians accessing online platforms daily through smartphones. As connectivity becomes more affordable and reliable, the adoption rates are poised for exponential growth.
Analyzing the perspectives shared by Kesava Reddy in his article on Express Computer, let's revisit the topic.
The Language Barrier: English Dominance in Online Content
However, a significant challenge lies in the fact that the majority of online content and services are in English, creating a barrier for many Indians to fully benefit from the digital world.
‘It will significantly expand the opportunities available to individuals who do not speak English. In addition to improving the accessibility of services in sectors including finance, governance, healthcare, and agriculture, it will facilitate revenue generation.’ - Kesava Reddy
Large Language Models (LLMs) emerge as a potential solution, representing a new frontier in artificial intelligence with remarkable capabilities across various domains.
Unlocking Linguistic Potential: The Rise of Open-Source LLMs in 2023
In 2023, a notable development in the field of LLMs was the emergence of open-source models. These models, released by various research groups and organizations, have demonstrated capabilities equal to or surpassing proprietary LLMs while allowing innovators to fine-tune them to their specific needs.
‘Due to their adaptability, these LLMs offer a great starting point for building India-specific LLMs. Many of these LLMs have already been trained on Indian languages.’ - Kesava Reddy
Some, like BLOOM and Falcon 180B, have been trained on Indian languages, including Hindi, Bengali, Tamil, Telugu, and Urdu.
Navigating Complexity: LLM Challenges in Indian Languages
However, a challenge arises as these LLMs perform better in English than in Indian languages due to the inherent complexities of sentence structures and contextual subtleties in the latter.
‘Unlike English, Indian languages are layered with complex sentence structures and contextual subtleties, requiring LLM architectures that are not only technically robust but also culturally aware.’ - Kesava Reddy
To truly harness LLM capabilities, there's a need for models specifically designed for Indian languages.
Innovative Initiatives: Overcoming Challenges in LLM Development
Several initiatives are underway to address this gap. Bhashini by MeitY has introduced Bhasha Daan, a crowdsourcing platform building an open repository of data to enrich Indian languages.
‘Building these models requires addressing some key challenges, such as creating Indian language datasets and training models in a way that they work with nuances of Indian languages.’ - Kesava Reddy
Initiatives like Bolo India, Suno India, Likho India, and Dekho India encourage users to contribute sentences, validate transcriptions, and enrich language through audio typing.
Startup Initiatives: Adapting AI Models for Indian Language Nuances
Startups are also adapting AI model architectures and training processes to handle the intricacies of Indian languages.
‘The model was trained on Hindi, English, and Hinglish, and the dataset used was a subsample of 100K documents from the Sangraha corpus.’ - Kesava Reddy
For instance, the OpenHathi model by Sarvam AI, built on Meta AI's Llama2-7B architecture, focuses on Hindi and Hinglish. Krutrim LLM by Krutrim Si Designs understands and responds in 20 Indian languages.
Overcoming Hurdles: Advanced GPUs and Future Prospects
Overcoming challenges such as limited datasets, startups are innovating in AI architecture and training processes.
‘With persistent efforts from cloud service providers, even cutting-edge GPU clusters like HGX H100 are now available on instant access.’ - Kesava Reddy
The release of advanced GPUs, critical for accelerating LLM training, has seen improvements, overcoming supply shortages that were prevalent in mid-2023.
Future Landscape: India-Specific LLMs Shaping Digital Inclusivity
Looking ahead to 2024, these efforts by startups and innovators are expected to yield India-specific LLMs, addressing local challenges. These models are set to enhance inclusivity, reduce the digital divide, and benefit regional economies. Startups can leverage these LLMs to create applications with deep vernacular language support in education, health, finance, and governance, contributing to the vision of a digital and inclusive society powered by AI in India.