A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP) and computer vision (CV).
Similar to recurrent neural networks (RNN) , transformers are designed to process sequential input data, such as natural language, with applications towards tasks such as translation and text summarization.
Why are Transformers preferred over RNNs?
However, unlike RNNs, transformers process the entire input all at once. Thprovides context for any position in the input sequence. For example, if the input data is a natural language sentence, the transformer does not have to process one word at a time. This allows for more parallelization than RNNs and therefore reduces training times.
Indeed, 70 percent of arXiv papers on AI posted in the last two years mention transformers. That’s a radical shift from017 IEEE that reported RNNs and CNNs were the most popular models for pattern recognition.
In a paper titled “A Comparative Study on Transformer vs RNN in Speech Applications”,
The researchers undertook intensive studies in which they experimentally compared and analyzed Transformer and conventional recurrent neural networks (RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS benchmarks. The experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN.
Evolution of Transformers:
It all started a few years ago where the advancement in AI was happening in computer vision with the great success of CNNs on the ImageNet challenge. The hype was revolving around the great success of Neural Style Transfer and Generative Adversarial Networks, and the list can go on likewise.
But AI researchers needed a model which is a neural network that learns context and thus meaning by tracking relationships in sequential data like the words in this sentence. Thus, transformers came into the picture.
First described in a 2017 paper from Google, transformers are among the newest and one of the most powerful classes of models invented to date. They’re driving a wave of advances in machine learning, some have dubbed transformer AI.
RNN:
Recurrent Neural Network(RNN) is a type of Neural Network where the output from the previous step is fed as input to the current step. In traditional neural networks, all the inputs and outputs are independent of each other, but in cases like when it is required to predict the next word of a sentence, the previous words are required and hence there is a need to remember the previous words. Thus RNN came into existence, which solved this issue with the help of a Hidden Layer. The main and most important feature of RNN is the Hidden state, which remembers some information about a sequence.
Shortcomings of RNN:
1. Vanishing Gradient Problem: The vanishing gradient problem is a major problem that affects the performance of RNNs. This is because the gradients that are used to update the weights of the network tend to become smaller and smaller as they are propagated back through the network, leading to an inability of the network to learn complex patterns.
2. Limited Memory: Another limitation of RNNs is their limited memory. This means that they are not able to remember long-term patterns and cannot generalize well to new data.
3. Inability to Model Long-range Dependencies: RNNs are also unable to model long-range dependencies, which means that they are not able to capture relationships between data that are far apart in time. This can lead to a lack of accuracy when predicting future values.
Why are Transformers replacing RNNs?
Before transformers arrived, users had to train neural networks with large, labeled datasets that were costly and time-consuming to produce. By finding patterns between elements mathematically, transformers eliminate that need, making available the trillions of images and petabytes of text data on the web and in corporate databases.
In addition, the math that transformers use lends itself to parallel processing, so these models can run fast. Transformers now dominate popular performance leaderboards like SuperGLUE, a benchmark developed in 2019 for language-processing systems.
Transformers are replacing RNNs in many applications because they are faster, more accurate, and more efficient than RNNs. Transformers are able to process sequences of data in parallel, which allows them to process longer sequences of data than RNNs can. Additionally, Transformers can capture long-term dependencies better than RNNs due to their attention mechanism. Finally, Transformers can be trained on larger datasets and can be deployed on GPUs for faster processing, making them ideal for many applications.
How does the Transformer Model work?
Neural networks have a lot of architectures that are trendy for different sensory modalities like for Vision Audio Text, It can be processed with different looking neural nuts and recently we have seen these convergence towards one architecture, the transformer. We can feed videos, images, audio & text which gobbles it up. It is similar to a general purpose computer that is trainable and very efficient to run on hardware.
Let’s just take a glimpse of the steps for transformer model:
Step 1: Pre-processing: In the pre-processing step, text is cleaned and formatted, and important information such as stopwords, punctuation, and symbols are removed.
Step 2: Tokenization: In this step, the text is split into tokens, or words or phrases.
Step 3: Embedding: The tokens are then embedded, or represented as numeric vectors, which are then used as input to the transformer model.
Step 4: Encoding: The embedded tokens are then encoded using a multi-layer encoder model such as a bidirectional LSTM or a convolutional neural network.
Step 5: Attention: Attention mechanisms are used to identify important words in the input, and to focus on them when generating the output.
Step 6: Decoding: The encoded tokens are then decoded using a decoder, such as an LSTM or a GRU, to generate the output.
Step 7: Post-processing: In the post-processing step, the output is formatted and cleaned, and symbols, punctuation, and stop words are added back.
Computational Requirements for Training Transformers:
Transformers are being used extensively across several sequence modeling tasks. Significant research effort has been devoted to experimentally probe the inner workings of Transformers. However, our conceptual and theoretical understanding of their power and inherent limitations is still nascent.
In particular, the roles of various components in Transformers such as positional encodings, attention heads, residual connections, and feedforward networks, are not clear. We analyze the computational power as captured by turing-completeness. There is an alternate and simpler proof to show that vanilla transformers are turing-complete and then we can say that Transformers with only positional masking and without any positional encoding are also turing-complete. We further analyze the necessity of each component for the Turing-completeness of the network; interestingly, we find that a particular type of residual connection is necessary.
New & Existing NVIDIA NGC containers:
- NVIDIA NeMo is an open source toolkit for conversational AI. It is built for data scientists and researchers to build new state of the art speech and NLP networks easily through API compatible building blocks that can be connected together.
Neural Modules are conceptual blocks that take typed inputs and produce typed outputs. Such modules represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations. NeMo makes it easy to combine and re-use these building blocks while providing a level of semantic correctness checking via its neural type system.
Conversational AI architectures are typically very large and require a lot of data and compute for training. Built for speed, NeMo can utilize NVIDIA's Tensor Cores and scale out training to multiple GPUs and multiple nodes. NeMo uses PyTorch Lightning for easy and performant multi-GPU/multi-node mixed-precision training. Every NeMo model is a LightningModule that comes equipped with all supporting infrastructure for training and reproducibility. Conversational AI architectures are typically very large and require a lot of data and compute for training. Built for speed, NeMo can utilize NVIDIA's Tensor Cores and scale out training to multiple GPUs and multiple nodes. NeMo uses PyTorch Lightning for easy and performant multi-GPU/multi-node mixed-precision training. Every NeMo model is a LightningModule that comes equipped with all supporting infrastructure for training and reproducibility.
Several pretrained models for Automatic Speech Recognition (ASR), Natural Language Processing (NLP) and Text to Speech (TTS) are provided in the NGC Collection for NeMo.
- PyTorch PyTorch is an optimized tensor library for deep learning using GPUs and CPUs. Automatic differentiation is done with a tape-based system at both a functional and neural network layer level. This functionality brings a high level of flexibility and speed as a deep learning framework and provides accelerated NumPy-like functionality. NGC Containers are the easiest way to get started with PyTorch. The PyTorch NGC Container comes with all dependencies included, providing an easy place to start developing common applications, such as conversational AI, natural language processing (NLP), recommenders, and computer vision.
The PyTorch NGC Container is optimized for GPU acceleration, and contains a validated set of libraries that enable and optimize GPU performance. This container also contains software for accelerating ETL (DALI, RAPIDS), Training (cuDNN, NCCL), and Inference (TensorRT) workloads.
- Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs. Triton supports an HTTP/REST and GRPC protocol that allows remote clients to request inferencing for any model being managed by the server. For edge deployments, Triton is available as a shared library with a C API that allows the full functionality of Triton to be included directly in an application.
- TensorFlow is an open source platform for machine learning. It provides comprehensive tools and libraries in a flexible architecture allowing easy deployment across a variety of platforms and devices. NGC Containers are the easiest way to get started with TensorFlow. The TensorFlow NGC Container comes with all dependencies included, providing an easy place to start developing common applications, such as conversational AI, natural language processing (NLP), recommenders, and computer vision.
The TensorFlow NGC Container is optimized for GPU acceleration, and contains a validated set of libraries that enable and optimize GPU performance. This container may also contain modifications to the TensorFlow source code in order to maximize performance and compatibility. This container also contains software for accelerating ETL (DALI, RAPIDS), Training (cuDNN, NCCL), and Inference (TensorRT) workloads.
You can also train your BERT model via Cloud GPUs. As we know, a GPU can accelerate transformer training by allowing networks to process more data in parallel, reducing the time required to train the model.
Check out Cloud GPUs on E2E Cloud: https://www.e2enetworks.com/products.
Now, You can build and launch machine learning applications on the most high-value infrastructure cloud in the market.
If you have any query, please connect: sales@e2enetworks.com
References & Citations: https://arxiv.org/abs/1909.06317