Introduction
Most of the breakthroughs in AI applications have been primarily focused on the English language. What about other languages? How can we ensure that AI serves a global audience? This is where CulturaX dataset comes into play. In this blog, we’ll take you through what CulturaX is and why it was created.
Current Limitations in AI and NLP
There’s comparatively less development of AI models and applications in languages other than English due to the unavailability of datasets in non-English languages. There's a plethora of high-quality datasets for English, but when you step into the world of non-English languages, the options dwindle considerably. Before CulturaX, datasets in non-English languages were often limited in size and coverage. This severely hindered the development of AI models in these languages. There aren’t accurate translation models available for under-represented languages like Polish, Finnish and so on.
CulturaX for AI Democratization
CulturaX is a new multilingual dataset that is designed to address this challenge. It was developed by Adobe Research in collaboration with the researchers of the University of Oregon. It is the largest and most comprehensive multilingual dataset ever released, with over 6.3 trillion tokens in 167 languages. CulturaX is also the first multilingual dataset to be thoroughly cleaned and deduplicated, making it ideal for training high-quality AI and NLP models.
By providing a massive and diverse multilingual dataset, CulturaX democratizes access to quality training data in multiple languages. This means AI developers across the world can now work on creating models that understand and communicate in languages that are underrepresented.
Creation of CulturaX: Merging & Processing mC4 and OSCAR
CulturaX was created by merging and processing two other large datasets: mC4 and OSCAR. The researchers used web-scraping data, as curated datasets are unavailable in non-English languages. This created an efficient data collection system across multiple languages, contributing to enhanced training data scales. mC4 is a multilingual dataset of text and code, while OSCAR is a dataset of text documents. The below pie-chart illustrates how the initial CulturaX dataset was prepared, from multiple versions of OSCAR and mc4.
Source: https://arxiv.org/pdf/2309.09400.pdf
Once the two datasets were merged, the researchers cleaned and deduplicated the data using a rigorous pipeline of multiple stages. Let’s take a look at a few important steps.
- Data Cleaning from Documents
In the data extracted from web pages, there would be a lot of noise like HTML tags, URLs, and other non-textual elements. These are removed and document refining is done to clean the dataset. Irrelevant contents like code snippets are removed.
- Increased Accuracy in Language Identification
One of the challenges of working with multilingual datasets is accurately identifying the language of each document. The FastText language detector was used for this dataset. FastText has shown great accuracy on benchmark datasets compared to other options like cdl3. The documents which couldn’t be confidently detected by FastText were removed to maintain data quality.
- Removal of Harmful Content and Noise
The researchers took steps to remove harmful content and noise from the dataset. The URL-based filtering method was used to eliminate content from sites like pornography, gambling, hacking, etc. A list of blacklisted sites was used to filter them out. This step also included removing documents that contain hate speech, violence, or other inappropriate content. It also included removing documents that were spam or otherwise irrelevant.
- Removal of Duplicate Entries in Data
As the data was collected from web pages, there was a lot of repetitive content. If you build an LLM on a dataset that has duplicates, the model will not be able to generalize well to new examples, leading to skewed results. They also inflate the data size and memory requirements. To avoid this, duplicate documents were identified and removed for each language of the dataset using the MinHash Deduplication method. Even though this was computationally expensive, this step was essential.
Stats of CulturaX
Now, let's talk numbers. The CulturaX dataset is an absolute behemoth – take a look at the numbers in the below image. That's a goldmine of information for AI researchers and developers, no matter where they are in the world.
Potential Applications of the Dataset
The versatility of CulturaX is exciting. Here are some areas where this dataset can shine:
- Multilingual chatbots - imagine chatbots that can converse fluently in multiple languages, not just English.
- Language translation - better training data means more accurate and nuanced translation services.
- Global sentiment analysis - performing sentiment analysis on social media data like tweets and posts published in multiple languages.
- Content moderation - safer online communities in multiple languages.
- Multilingual search engines
- Multilingual social media platforms
The Future of CulturaX
While CulturaX is a remarkable step forward, there's always room for growth and improvement. Here are some questions to ponder upon:
- How to involve under-resourced communities - that is, identifying potential harms from AI systems built using their languages and data.
- How to fill remaining data gaps - for languages with limited text data, creative solutions can help generate training data.
- How to explore multimodal learning - using images, videos, and speech data alongside text may improve versatility.
- How to test rigorously for biases - across languages, cultures, demographics, and tasks before deployment.
- How to invest in two-way open research - so insights from global researchers continuously improve shared models and data.
Conclusion
It's a testament to the power of collaboration, data cleaning, and dedication to democratize AI access. With this gigantic, cleaned, and multilingual dataset, the possibilities for AI research and development are boundless. As we look into the future, the importance of continually improving and expanding such datasets cannot be understated. The world is multilingual, and our AI models should reflect that diversity.