Introduction
Neural Networks consist of some of the few basic building blocks around which the whole deep learning sphere revolves, here are the fundamentals that we need to know before gaining an understanding of why we initialize a Neural network with random weights after all.
- Layers, which are combined into a network
- Input data and corresponding targets
- Loss function, which defines the feedback signal used for learning
- The optimizer determines how weights are adjusted in-network or model.
In short, a neural network model is nothing but a set of layers that contains weights that minimizes the loss function of train and validation data. The loss function sends feedback signals back to layers through algorithms called optimizers, like Stochastic gradient descent.
Here is when adjustments of weights come into the picture. The optimizer adjusts the weights of the previous iteration to maximize the accuracy. A model trained using Stochastic gradient descent uses randomness when adjusting weights to find optimal weights in map inputs to outputs that need to be learned.
In this blog, we will gain a deeper understanding of the optimization process of SGDs and the need of random initialization in Neural network models.
Table of contents:
- Stochastic gradient descent Algorithm
- Random Initialization in Neural Networks
- Kera’s Random Initialization methods
- Example of Random initialization weights in Neural Network model
Stochastic gradient descent algorithm
It is frequently necessary to utilize nondeterministic algorithms that heavily rely on randomization to solve extremely difficult search tasks. The algorithms carefully use randomness rather than being completely random. They are known as stochastic algorithms and are random within a bound.
The method and the algorithms are frequently referred to as an optimization from an initial state or position to a final state or position due to the incremental, or stepwise, structure of the search. a stochastic optimization algorithm or a stochastic optimization problem, for instance.
The genetic algorithm, simulated annealing, and stochastic gradient descent are a few examples.
In order to find a good enough answer, the search process moves incrementally from a beginning point in the space of potential solutions.
They share common features in their use of randomness, such as
- Use of randomness during initialization.
- Use of progression during progression of the search
Random Initialization in Neural Networks
Stochastic gradient descent is a stochastic optimization approach used to train artificial neural networks.
In order to discover a suitable collection of weights for the particular input-to-output mapping function in your data that is being learned, the technique leverages randomization. It implies that every time the training algorithm is run, a different network with a different model skill will match your individual network on your specific training data.
Weights are present between every two layers in neural networks. The values of the subsequent layer are produced by the linear transformation of these weights and the values in the preceding layers through a non-linear activation function. Layers are passed over one another during forward propagation, and by using backpropagation, the ideal weight values for a specific input can be determined.
By far, this is the best approach to train deep neural networks and find an optimal model which is sufficiently skilled and generalizes well to unseen data.
In particular, initializing the weights of the network to small random values is necessary for stochastic gradient descent (random, but close to zero, such as in [0.0, 0.1]). During the search process, the training dataset is randomly shuffled before each epoch, which causes variations in the gradient estimate for each batch.
What if we initialize the weights with Zero?
Zero initialization is useless. Symmetry-breaking is not carried out by the neural network. In this situation, the learning algorithm's equations would be unable to alter the network weights, and the model would become stuck. It is crucial to remember that each neuron's bias weight is by design fixed to zero rather than a seemingly random amount.
w=np.zeros((layer_size[l],layer_size[l-1]))
Random Initialization
This improves precision and aids in the symmetry-breaking process. The weights are initialized arbitrarily in this technique, but very nearly at zero.
w=np.random.randn(layer_size[l],layer_size[l-1])*0.01
Python’s Keras library has initializer methods that can be used while building deep learning models. Below is one of the examples of initializers.
Kera’s Random Initialization methods
Below are the methods available in Keras for random initialization of the neural network.
- RandomNormal class - Initializer that generates tensors with a normal distribution
- RandomUniform class - Initializer that generates tensors with a uniform distribution.
- TruncatedNormal class - Initializer that generates a truncated normal distribution.
- Zeros class - Initializer that generates tensors initialized to
- Ones class - Initializer that generates tensors initialized to 1
- GlorotNormal class - The Glorot normal initializer, also called Xavier normal initializer
- GlorotUniform class - The Glorot uniform initializer, also called Xavier uniform initializer.
- HeNormal class - He normal initializer.
- HeUniform class - He uniform variance scaling initializer
- Identity class - Initializer that generates the identity matrix.
- Orthogonal class - Initializer that generates an orthogonal matrix.
- Constant class - Initializer that generates tensors with constant values.
- VarianceScaling class - Initializer capable of adapting its scale to the shape of weights tensors.
Benefit of Random initialization of weights in Neural Network model
A good use-case can be word-embedding algorithms used in vectorization of tokens in NLP tasks. Pre-trained word embedding algorithms like Word2vec or Glove are trained on large text data taken from Wikipedia. They can be used in NLP tasks but they fail to capture semantics of each NLP task while embedding layers initialized with random weights gradually adjusted via backpropagation, structuring space into something the downstream model can exploit and this kind of structure specializes for the specific problem that we are solving.
Conclusion
In this blog, we learned how a neural network model optimizes model parameters using a stochastic gradient descent algorithm. And the need of random initialization in neural networks along with kera’s random initialization methods.
References:
[1] Deep learning with python by François chollet
[2] https://en.wikipedia.org/wiki/Stochastic_gradient_descent
[3] https://machinelearningmastery.com/why-initialize-a-neural-network-with-random-weights/
[4] https://keras.io/api/layers/initializers/