LLMs have been all the rage in 2023. Our CTO, M. Imran, writes about how Indic language data can be secured in the cloud in this blog in ‘Express Computer’: https://www.expresscomputer.in/guest-blogs/security-in-the-cloud-safeguarding-indic-language-data-with-large-language-models/106669/
In this blog, we summarize the principal points of the article for ease of understanding.
Introduction
Large Language Models (LLMs) have been making waves in 2023. These AI models are capable of creating human-like text and understanding human language, intriguing users and opening doors for potential applications in areas such as customer support and education. Their skills in language translation, answering questions, summarizing text, and enabling conversational AI mark a potential shift in human-machine interaction design.
Building Large Language Models (LLMs)
What makes Large Language Models (LLMs) captivating is their construction process. They are built using sophisticated machine learning (ML) algorithms, notably transformers. These models are trained on wide-ranging datasets that encompass various subjects, languages, and styles, using advanced cloud GPUs. Through this training, the models learn to anticipate the subsequent word in a sentence by considering the context of preceding words. This skill, honed by training and refining the model with countless data points from these datasets, equips the model to understand intricate elements of language, such as syntax, semantics, and general knowledge.
Indic-Language LLMs for India's Diversity
The behavior of Large Language Models (LLMs) is significantly influenced by the nature and composition of their training datasets. For example, a model trained on a dataset curated for a particular cultural context will yield responses that are most relevant to that context, capturing the specific linguistic styles, knowledge, and semantic subtleties of that dataset.
This aspect is especially important for a country as culturally and linguistically diverse as India. Indian languages like Hindi, Bengali, Gujarati, Telugu, Marathi, Tamil, and Urdu are not only spoken by millions in India but also by a growing global diaspora. Each of these languages possesses its own unique script, literary background, and regional distinctions. To effectively utilize the potential of LLMs for India's extensive user community, it's essential to create foundational models for Indic languages, each trained on a varied dataset tailored to the specific language. Although the fundamental technology for developing LLMs remains the same, the datasets need to be customized according to the respective language.
Addressing Risks
Maintaining the integrity of datasets is essential, particularly regarding their security during pre-training and training phases. Given that these datasets may contain millions or billions of data points sourced from various Indian public domains, adhering to Indian legal standards is a necessity. The training phase utilizes cutting-edge cloud GPUs, like the AI supercomputer HGX 8xH100 introduced by E2E Cloud in India. This platform, equipped with H100 GPUs, is adept at managing AI models with trillions of parameters, making it ideal for constructing foundational language models. The duration of training these models can span from several days to months, during which the protection of both the model and its dataset is of utmost importance. The hyperscale cloud platforms used for training must be robust against external interference and comply fully with Indian IT regulations, making the selection of a cloud provider a critical decision.
‘LLMs are also susceptible to prompt poisoning, a technique where attackers manipulate the training process by introducing adversarial prompts with toxic or biased content. If these prompts are included in the model’s training data, they can drastically affect the LLM’s output. For example, attackers could insert prompts that cause the LLM to ignore certain user inputs or generate offensive text, posing significant risks once the LLM is deployed. To mitigate such risks, the dataset and training process must be conducted in a highly secure and protected environment. Secure hyperscale cloud GPU platforms, specifically built in India with India-centric security and privacy compliances, therefore become indispensable,’ notes M. Imran.
Key Decision Factors for Indic-Language LLMs
‘Indian languages, including Hindi, Bengali, Gujarati, Telugu, Marathi, Tamil, and Urdu, are spoken by millions across the country and an ever-expanding diaspora. Each language has its distinct script, literary heritage, and regional variations. To harness LLM capabilities for India’s vast user base, we need to develop Indic-language foundational models, trained on diverse datasets specific to each language. While the core technology of building LLMs remains constant, the datasets must vary based on the language,’ says M. Imran.
When developing Indic-language Large Language Models (LLMs), several crucial factors come into play. Firstly, selecting Cloud Service Providers (CSPs) that provide immediate access to sophisticated GPU platforms, such as the HGX 8xH100 AI Supercomputer, is vital as it reduces training time and offers state-of-the-art technology. Secondly, it's essential that these CSPs are fully compliant with Indian IT regulations. Lastly, the datasets employed in the training process must be diverse, unbiased, and accurately represent the cultural intricacies of each language.
Opportunity for India
‘Since dataset purity is key, security of the dataset before and during training becomes paramount in this context,’ says M. Imran.
Creating and applying Indic-language Large Language Models (LLMs) offers a significant advantage for India. These AI models, designed to cater to the varied languages and rich cultural heritage of the Indian subcontinent, could bring unmatched benefits to India's large and expanding population of users. By training these models with a range of datasets specific to different regions, we can make sure that they not only comprehend but also align with the various dialects, languages, and cultural subtleties of India. This targeted strategy for LLMs has the potential to revolutionize sectors like education, customer service, and technology.
‘The AI training platform, powered by H100 GPUs, is capable of handling trillion parameter AI models and is designed for building foundational language models. Training time for foundational language models can range from days to months, and throughout this process, the security of the model and dataset is crucial. The hyperscale cloud platform used for training must be impervious to foreign intrusion. Furthermore, they should be fully compliant with Indian IT laws,’ says M. Imran.
Conclusion
Unlocking the complete capabilities of these Large Language Models (LLMs) hinges on their ethical and secure creation and implementation, adhering to legal requirements. Concentrating on these elements allows us to provide the people of India with AI solutions that are not only technologically sophisticated but also culturally sensitive and morally responsible. This approach will enable us to fully exploit the transformative potential of LLMs to meet the distinctive and varied demands of India.