Hyperparameter optimization in Machine Learning Part-1: Algorithms
In these two article series, we are going to learn about Hyperparameter Optimization in Machine learning. Part-1 is about the theoretical foundation and algorithms, while Part-2 will be devoted to practical aspects like libraries, frameworks, and services for hyperparameter optimization.
Hyperparameter optimization for machine learning is a particular use case of Black box optimization. Now, What are hyperparameters and what is black box optimization? Let's start with learning about Black Box optimization. When we have a function with some known parameters but you can't really peer inside the function a, you want to find the best setting options for these parameters. The only way to assess their efficiency is to set the parameters appropriately and run an experimental process, which is considered quite expensive. Running this process gives objective value or performance measure through which you can find the best parameters. The system is relatively opaque through which evaluation can be costly.
Let's take an example of an object detection model. In the image below you can observe the various hyperparameters associated with the model, training, and suppression:
In machine learning, hyperparameter optimization is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned. One cannot learn hyperparameters during the training an we need to set them in advance. Afterwards, they can be formulated as black-box optimization. AutoML (the process of automating the time-consuming, iterative tasks of machine learning model development) is also a black-box optimization problem.
There are two main algorithmic approaches for black-box optimization:
- Bayesian Optimization
- Simple probabilistic algorithm
Bayesian Optimization
Bayesian Optimization is one among many popular approaches. It is the best approach for a broad range of problems.
As an example, we take a single dimension parameter space and plot the objective function on the y-axis. In the Bayesian approach, we try to model the objective function. The green points show the values of parameters and the black line is an objective function but it is hidden from us. So we try to come up with some sort of region in which the optimization function lives (which is shaded in blue in the above figure). Now, this problem is turned into an explore-exploit tradeoff problem which we call a multi-arm bandit problem.
Now the question arises what is a multi-arm bandit problem? In probability theory and machine learning, the multi-armed bandit problem is a problem in which a fixed set of resources must be allocated between competing choices in a way that maximizes their expected gain when each choice's properties are only partially known at the time of allocation and may become better understood as time passes or by allocating resources to the choice.
One standard strategy is to find the maximum point in the right envelope which is shown as the star in the diagram and use that value of the parameter. As we can see, the objective function value for the star parameter is not the best so we change and update this picture.
Using Gaussian Process to model the function
Gaussian Process is state-of-the-art for many small datasets but there are other choices too for modeling the objective function. In probability theory and statistics, a Gaussian process is a stochastic process, such that every finite collection of those random variables has a multivariate normal distribution. The distribution of a Gaussian process is the joint distribution of all those (infinitely many) random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.
Simple Probabilistic Algorithm
One approach can be a simple random search (Random Search sets up a grid of hyperparameter values and selects random combinations to train the model and score) but there are techniques that are better:
- Uniform Random Sampling
- Ball Sampling (probabilistic gradient-free hill-climbing)
- Linear Combination Sampling: Create a new point with a linear combination
Automated Early Stopping
For some problems like Machine Learning, the optimization is not a complete black box as we have access to additional information i.e. information found during training.
As depicted in the accuracy vs training time plot, in the above figure, we have some solid lines which depict the training completed and a dotted line showing the ongoing epoch. But comparing these two we can make the decision of stopping early if we are sure that this combination of parameters is not going to give optimal results.
Closing Thought:- We encourage the readers to create a neural network and do hyperparameter optimization on a free GPU Trial on E2E Cloud. For getting your free credits please contact:- sales@e2enetworks.com