How to use Apache Spark and Horovod for scaling Deep Learning Models on E2E Cloud?

May 20, 2022

Tags

Apache Spark and Deep Learning

In any deep learning pipeline, there are two different phases. One involves handling data, and the second is training on that data. Training deep learning models is quite different from most of the problems that we solve with Apache Spark. With Spark, it is possible to handle data quite well. There are many functions available with Spark to pre-process data. The Deep learning model step follows the basic ETL (Extract-Transform-Load) step of Apache. At the model training step, Apache’s basic functions start failing at distributed computations. However, there is an approach available.

Distributed Deep Learning

One of the approaches used for early distributed deep learning with Spark is through the concept of Parameter servers. There are two concepts for this. One is to use model parallelism. It means to split the layers into different GPU machines and send the weights to the next layer machine. The second is using Data parallelism. In this, the data is split into subsets, and each GPU machine trains on each such subset. The weight is then aggregated using the parameter servers, and the global weight is then sent to each GPU back. There are two main advantages to using parameter servers for distributed training. One is fault tolerance. As the model architecture is copied, there can be backups for the same. It allows for the window of failure on one or more GPUs. Second is the support for asynchronous SGD (Stochastic Gradient Descent). This technique allows the developer to train single instances over different frequencies in any of the machines. But it is not all perfect. There are some serious limitations to this approach.

Limitations with the approach

The first problem of using parameter servers is that they are complex to implement. The usability has been an issue since the introduction of the method. The developers need to scaffold a lot of code to integrate parameter server associated training securely. The problem with distributed training is the efficiency of scale. If it is not set right, there is no use in using multiple machines at all. One other problem with the approach is that parameter servers mostly end up in a major bottleneck. Now one of the advantages of parameter servers is also a disadvantage sometimes. The asynchronous SGD can be advantageous many times but is less preferred on different frequencies as it typically leads to degradation and model convergence. So in practice, it is preferred to have all individual machines operating at the same frequency, even though asynchronous SGD allows for running on different frequencies.

Horovod

Looking at the limitations of the current frameworks, which were mostly implemented based on parameter networks, folks at Uber developed an elegant solution – Horovod. Before we go into the details of the technique of Horovod, here are a few of the advantages of using this framework.

Minimal code: With Horovod, setting up a distributed deep learning pipeline can be done using as low as just five lines of code.
High performance: It allows us to integrate all the high-performance features like Nvidia’s Nickel, GPUDirect, as well as RDMA.
Framework support: It supports all the frameworks running on Spark in addition to Tensorflow, PyTorch, Keras, as well as ApacheMXNet.
Ease of installation: As it is all in one package, the installation required is only to type one command – pip install Horovod. That’s it.

Now, what does Horovod do differently that makes it so efficient in scaling deep learning training? Horovod uses a decentralized technique known as AllReduce. In this technique, at each step, the computation is made in every GPU and weights are propagated to the next machine. It makes it easier to set up and update the global parameter at each iteration. It also makes the training a bit faster and effective at scaling. The decentralized structure can be a ring, or a tree, whichever way it is most efficient and also depending on how Nickel wants to optimize it.

To use Horovod for scaling the Deep learning models, add the KerasEstimator of the Horovod package, into the basic Spark pipeline, and Horovord and Spark from there will handle everything.

Why E2E Networks

E2E understands the cost involved in higher computations. If your business is as of yet small, you do not want to spend too much on cloud computations. But that is why we bring the cloud to you at an affordable rate. Additionally, it is an easy way to use the platform that can be quickly adapted by your team. E2E cloud service offers the best availability, reliability, and advanced technical stacks. The infrastructure is built to be one of the most affordable cloud services that can save up to 40-60 per cent against major platforms.

Conclusion

Distributed deep learning can be tricky, but with the right tools and platform, it can be achieved quite comfortably. With Horovod’s simplicity and E2E cloud service’s power and affordability, take your deep learning models to scale quickly and efficiently.