India is a land of linguistic diversity, with 22 official languages and hundreds of regional and local languages spoken across the country. However, the Indian education system has not been able to fully leverage this rich resource of multilingualism, as most of the curriculum and instruction is delivered in either English (or, sometimes, Hindi), the dominant languages in the country.
Having grown up, educated and then worked from Delhi / NCR, I have been able to avail of the benefits of both languages. Today, as we step into 2024, and after having traveled across the country, I feel that there’s a barrier for many students who are not proficient particularly in English, especially those from rural or marginalized communities, which hinders their learning outcomes and opportunities.
Fortunately, there is a new wave of innovation in the field of artificial intelligence (AI) that can potentially transform the Indian education landscape: multilingual large language models - a subset of LLMs that can handle multiple languages, either by training on a single model with data from different languages, or by using multiple models that can communicate and transfer knowledge across languages. Multilingual LLMs can enable cross-lingual understanding and generation, which can be very useful for multilingual societies like India.
But even before we discuss the multilingual LLMs from India, let’s dissect the various facets that contribute to the complexities of developing LLMs in India.
The Challenges with the Architecture of Indian Languages
Complexities of Regional Languages
The development of Large Language Models (LLMs) for Indian languages presents a distinct challenge due to their complex grammar and cultural nuances, including idioms. Unlike English, which has a more straightforward syntax and grammar, Indian languages such as Hindi, Bengali, Telugu, Marathi, Malayalam, Gujarati, Kannada, Punjabi and Tamil have complicated sentence constructions and contextual nuances. This requires an LLM design that is not just technically advanced but also culturally sensitive.
Script Diversity and Phonetics
India's vast array of languages is also reflected in its variety of scripts, ranging from the Devanagari script of Hindi to the Gurmukhi script of Punjabi. Each of these scripts has distinct features and phonetic complexities. For a multilingual Large Language Model to be effective, it needs to be proficient in handling these diverse scripts, grasping their phonetic details.
Cultural Context and References
In India, language is intricately intertwined with cultural contexts. Idioms, regional colloquialisms, and terms laden with cultural significance play a crucial role in communication. A prime example is 'Hinglish,' a widely used blend of Hindi and English, which highlights the dynamic and intricate nature of language usage in India. It's essential for an LLM to be capable of smoothly navigating between languages, effectively capturing the nuances of such blended linguistic forms.
Other than the complexities of Indian language, the other question is of the expenses involved and the resources required.
High Training Costs and Data Scarcity
A major obstacle in creating multilingual Large Language Models for Indian languages is the lack of extensive digital data. There is a limited availability of online resources for many Indian languages, which makes gathering and organizing data both expensive and labor-intensive. Furthermore, the computational power needed to train these intricate models is considerable, resulting in higher total costs for development.
Infrastructure Questions
The scarcity of comprehensive digital data requires a strong computational framework for the training and development of models. This demands not only high-performance hardware but also advanced software tools designed to manage substantial datasets and execute complex tasks in language processing.
The good news is that a lot of regional language models are getting developed as we speak.
Examples of Multilingual LLMs for Education in India
There are several open-source multilingual LLMs that have been developed or are being developed for the Indian context, either by academic institutions, research organizations, or private companies.
OpenHathi
OpenHathi, developed by Sarvam AI, stands as the inaugural publicly accessible Hindi Language Model (LLM). Boasting 7 billion parameters, it adeptly handles both Hindi and English languages through a distinctive tokenizer and a three-phase training methodology. Its applications span text translation, information retrieval, and content generation, catering to the Hindi market’s diverse needs.
Tamil-LLAMA
Tamil-LLAMA, an advanced LLM tailored for the Tamil language, is the brainchild of Abhinand Balachandran. An evolution of the LLaMA model, it significantly amplifies its capability to handle Tamil text. By integrating an additional 16,000 Tamil-specific tokens into its vocabulary, Tamil-LLAMA excels in tasks like text summarization, sentiment analysis, and text classification specific to the Tamil language.
Bhashini
Bhashini is a multilingual LLM that can handle 10 Indian languages, developed by Reverie Language Technologies. It has 1.5 billion parameters and can perform tasks such as text summarization, translation, and question answering, across languages. Bhashini is based on the mBERT model, but uses a novel data cleaning and filtering technique to improve its quality.
These are just some of the examples of multilingual LLMs that are built in India or for the Indian market. There are many more models that are being developed or are in the pipeline.
So, what are the benefits of these models?
Benefits of Multilingual LLMs for Education in India
Multilingual LLMs can have several benefits, such as:
Enhancing the Quality and Accessibility of Educational Content
Multilingual LLMs can help create, translate, and summarize educational content in various languages, making it more accessible and engaging for students from different linguistic backgrounds.
For example, multilingual LLMs can help generate textbooks, lectures, quizzes, and assignments in multiple languages, or translate existing content from one language to another, or summarize long and complex content into concise and simple summaries.
This can help students learn better and faster, as they can access content in their preferred or native language, and avoid the cognitive load of learning in a foreign language.
Enabling Personalized and Adaptive Learning
Multilingual LLMs can help provide personalized and adaptive learning experiences for students, by tailoring the content and feedback according to their individual needs, preferences, and progress.
For instance, multilingual LLMs can help create personalized learning paths, recommend suitable learning resources, provide instant feedback, and monitor and assess the learning outcomes of students, in multiple languages. As a result, students learn at their own pace and level, and improve their motivation and retention.
Multilingual Literacy and Awareness
Multilingual LLMs can enhance multilingual literacy and awareness among students, by exposing them to diverse languages and cultures, and by encouraging them to learn and appreciate the richness and beauty of their own and other languages.
As an example, a multilingual LLM can generate stories, poems, songs, or jokes in different languages, and explain the meaning, context, and nuances of the words and expressions used. It can also provide information and insights on the history, geography, literature, and art of different regions and communities, and inspire curiosity and creativity among students.
Empowering Teachers and Educators
Multilingual LLMs can help empower teachers and educators, by augmenting their capabilities and reducing their workload. To illustrate, multilingual LLMs can help teachers create and deliver high-quality and diverse content, provide effective and timely feedback, and evaluate and track the performance of students, in multiple languages.
Streamlining tasks will allow teachers to concentrate on the pedagogical and emotional aspects of teaching. They can invest time in designing engaging activities, fostering discussions, and offering guidance and support without being encumbered by excessive demands.
Other than the above, multilingual LLMs can have several broader advantages that might be used for educational purposes.
Broader Opportunities in Society
Education in Rural Areas
Multilingual Large Language Models have the potential to transform education, especially in rural regions. By delivering educational materials in a range of local languages, these models can help close the gap in educational resources and enrich the learning experience.
Translation Services
Large Language Models can be crucial in offering accurate translation services, which are vital in a multilingual nation like India. They enable smooth communication between different language groups.
Upskilling and Government Digital Initiatives
In line with government efforts towards digital inclusion, Large Language Models can provide avenues for skill enhancement and expanding access to government services among speakers of various languages.
Final Words
Overall, the opportunities look promising. By harnessing the potential of multilingual LLMs, I’m hoping that we can create a more inclusive and effective education system in India, which will cater to the needs of students from diverse backgrounds. Very soon, we may see that students from every background are competing with each other on an equal footing.