Introduction to MLP and Optimization algorithms

Introduction

Supervised Learning problem to find a function $f$ such that for each input $x$ , we predict an output $y = f(x)$ . Problems whose output is a continuous value are regression problems, while problems whose output is only in a finite set of given values are classification problems.

This article will focus on the simplest type of model of deep learning network, which is the Multi-layer Perceptron (MLP). The goal of a MLP model is to approximate the target function $f$ mentioned above. Specifically, a deep learning model define $y = f(x, \theta)$ , where $\theta$ represents the model parameters, and seeks the optimal $\theta$ such that the model best approximates $f$ . This article will also present some optimization algorithms that used to find the parameter set for deep learning models in general.

MLP

Some symbols and concepts

Layers

An MLP model will consist of 3 kinds of layer: starting with 1 input layer, possibly multiple hidden layers and ending with 1 output layer. The figure below is an example of an MLP with 2 hidden layers:

The length of the “chain” of layers is called the depth of the model. The concept of deep learning also originates from this term. The depth of the model is calculated by the number of hidden layers plus the output layer, notated by $L$ . In the above example, we have $L=3$ . Each hidden layer will have an activation function, the output layer may or may not have one. We will present the activation function in detail in the following section.

Unit

A node of each layer is called a unit. Here we notate the symbol of input of the $i^{th}$ unit at the $l^{th}$ layer is $z_i^l$ and the corresponding unit’s output is $a_i^l = g^l(z_i^l)$ , where $g^l$ is the activation function at that layer. Normally, we write the units in a layer as a vector, the input symbol of the $l^{th}$ layer is $z^l \in R^d$ , the output of the $l^{th}$ layer is $a^l = g^l(z^l) \in R^d$ , where $g^l(z^l)$ applies the activation function on each unit, $d$ is the number of units of layer $l$ .

Weight and bias

In an MLP network with L layers, there will be L weight and bias matrices corresponding to each layer. The weight matrices are notated as $W^{(l)} \in R^{d_{l-1} * d_l}, l = 1,2…L$ . In which $W^{(l)}$ represents the connection from layer $l-1$ to layer $l$ . Specifically, element $w_{ij}$ of matrix $W^l$ represents the connection from the output of the $i^{th}$ unit of layer $l-1$ to the $j$ input of layer $l$ . The bias of the $l^{th}$ layer is notated as $b^l \in R^{d_l}$ . When training the model, we look for these weights and biases.

The input of unit $z_j^l$ is connected to the output of unit $a_i^{l-1}$ of the previous layer with the formula:

z_j^l = w_{ij}^{l}a_i^{l-1}+b_j^l

We can rewrite it in the form of matrix operations as follows:

z^l = W^{(l)}a^{l-1}+b^l

where $z^l, b^l \in R^{d_l}, a^{l-1} \in R^{d_{l-1}}, W^{(l)} \in R^{d_{l} \times d_{l-1}}$

Activation function

We notate $g^l$ is the activation of the $l^{th}$ layer. The output of each unit will be calculated by:

a_j^l = g^l(z_j^l) = g^l(w_{ij}^la_i^{l-1}+b_j^l)

We can rewrite it in the form of matrix operations as follows:

a^l = g^l(z^l) = g^l(W^{(l)T}a^{l-1}+b^l)

where $a^l, z^l \in R^{d_l}$ . The activation function will be applied to each unit of that vector.

Here we list some common activation function:

Sigmoid function

The sigmoid function: $f(s) = 1/(1 + e^{-s})$ . The output value of the sigmoid function will be in the range $(0, 1)$ . If the input is larger, the output value is closer to 1 and vice versa, the smaller the input is, the output value is closer to 0.

Tanh function

The tanh function: $f(s) = (e^s - e^{-s})/(e^s + e^{-s}) = 2 \phi (s) - 1$ where $\phi$ denote the sigmoid function. The output value of the sigmoid function will be in the range (-1, 1). If the input is larger, the output value is closer to 1 and vice versa, the smaller the input is, the output value is closer to -1.

ReLU function

ReLU function: $f(s) = max(0, s)$ . The output value of the function is equal to the input value if the input is greater than 0, otherwise the output value is 0.

Cost function

Determining the cost function is necessary because it determines the objective that the model wants to optimize. It directly determines how the model is trained, determines how well the model performs, and determines how the model converges. For different problems, we choose the appropriate cost function.

For regression problems, commonly used cost functions are MAE and MSE. The symbols N, M are the number of data samples and the dimension of the output vector, respectively.

MAE (Mean Absolute Error)

J(y, \hat{y}) = \frac{1}{N}\frac{1}{M}\sum_{i=1}^{N}\sum_{j=1}^{M}|y_{ij} - \hat{y}_{ij}|

MSE (Mean Squared Error)

J(y, \hat{y}) = \frac{1}{N}\frac{1}{M}\sum_{i=1}^{N}\sum_{j=1}^{M}(y_{ij} - \hat{y}_{ij})^2

For classification problems, commonly used cost functions are Cross Entropy and Binary Cross Entropy, in which Binary Cross Entropy is Cross Entropy for 2-class classification problems. The symbols N, M are the number of data and the number of labels classified, respectively.

Cross Entropy

J(y, \hat{y}) = \frac{1}{N}\frac{1}{M}\sum_{i=1}^{N}\sum_{j=1}^{M}-\hat{y}_{ij}log(y_{ij})

Binary Cross Entropy

J(y, \hat{y}) = \frac{1}{N}\sum_{i=1}^{N}(-\hat{y}log(y) - (1-\hat{y})log(1-y))

Feed Forward

After determining the basic components of an MLP network, including the number of layers, the number of nodes in each layer, the activation function in each layer and the cost function, the input vector will be passed from the input layer to the output layer to get the prediction results of the model. Normally, n data samples (batch) are often passed into the model at the same time. The input of the model is notated as $A^{(0)} \in R^{n \times d_0}$ .

For each layer, compute:

Z^{(l)} = A^{(l-1)}W^{(L)T}+B^{(l-1)}

A^{(l)} = g^{(l)}(Z^{(l)})

where $A^{(l-1)} \in R^{n \times d_{l-1}}; W^{(l)} \in R^{d_{l} \times d_{l-1}}; B^{(l-1)}, Z^{(l)}, A^{(l)} \in R^{n \times d_{l}}, l = 1…L$ . Matrix $B^{(l-1)}$ has column vectors that are bias vectors $b^{l-1}$ .

The model’s output is passed into the cost function to determine the model’s performance

J = J(\hat{y}, y)

where $\hat{y} = A^{(L)}$ is the model’s prediction

Back Propagation

A common method used to optimize model parameters is to use the Gradient descent algorithm, with the main idea being to update model parameters based on the gradient of the cost function $J = J(W, b, X, Y)$ we just calculated in the feed forward step above for each model parameter. Back propagation is an algorithm used to support the above calculation. Specifically, we need to calculate:

\frac{\partial{J}}{\partial{W^{(l)}}}, \frac{\partial{J}}{\partial{b^{(l)}}} ; l =1,2,...,L

First, we calculate the derivative of the cost value with respect to the output layer. Notice that $Y \in R^{n \times d_l}$ :

E^{(L)} = \frac{\partial J}{\partial Y} = \frac{\partial J}{\partial Z^{(L)}}

For each layer $l = 1, 2, … L$ , we compute:

\frac{\partial J}{\partial W^{(l)}} = \frac{\partial J}{\partial Z^{(l)}} \frac{\partial Z^{(l)}}{\partial W^{(l)}} = E^{(l)T}A^{(l-1)}

\frac{\partial J}{\partial B^{(l)}} = \frac{\partial J}{\partial Z^{(l)}} \frac{\partial Z^{(l)}}{\partial B^{(l)}} = E^{(l)}

E^{(l-1)} =\frac{\partial J}{\partial Z^{(l-1)}} = \frac{\partial J}{\partial Z^{(l)}} \frac{\partial Z^{(l)}}{\partial A^{(l-1)}}\frac{\partial A^{(l-1)}}{\partial Z^{(l-1)}} = E^{(l)} \odot g^{(l-1)'}(Z^{(l-1)})W^{(l)}

Given operation $\odot$ is the hadamard product operation (point-wise element product), that is, multiplying each element of this vector by the corresponding element of the other vector.

Note that the matrix $B^{(l)}$ has column vectors that are bias vectors $b^{(l)}$ , therefore $\frac{\partial J}{\partial b^{(l)}}$ is equal to sum of column vectors of $\frac{\partial J}{\partial B^{(l)}}$ :

\frac{\partial J}{\partial b^{(l)}} = \sum_{i=0}^n (\frac{\partial J}{\partial B^{(l)}})_i

Gradient Descent

In deep learning, the optimization problem that needs to be solved is to find the parameters of the model so that it minimizes the value of the cost function. The optimization algorithms used in deep learning are the gradient descent algorithm group. In this section, the author will introduce how the gradient descent algorithm works, the common challenges in optimization problems in deep learning and other gradient descent algorithms.

Gradient descent algorithm

Consider a function $f(x)$ and assume it has a first derivative at every $x \in R$ . We have a linear approximation formula

f(x +\Delta x) \approx f(x) + f'(x)\Delta x

with $f'(x)$ being the first derivative of $f(x)$ with respect to $x$ . We choose $\Delta x = - \alpha f'(x)$ with alpha being a small enough number:

f(x - \alpha f'(x)) \approx f(x) - \alpha f'(x)^2 \leq f(x)

It can be seen that choosing $\Delta x = - \alpha f'(x)$ can help reduce the value of the function $f(x)$ , thereby helping the function move towards a local minimum value. Simply put, the derivative of a function at a point indicates the direction of increase of the function, so we will go in the opposite direction of the derivative.

In deep learning, the term gradient is often used instead of the term derivative. Applying back to the optimization problem in deep learning, the weights of the model will be updated by subtracting the gradient of the cost function multiplied by the learning rate coefficient alpha:

\theta_{new} = \theta_{old} - \alpha \nabla_{\theta}J(f(\theta, x), y)

(Illustrating how the gradient descent algorithm works: From any initial weight, at each step the gradient descent algorithm will calculate the gradient of the cost function at that point, and the weight will be updated. This process will continue until the algorithm can no longer optimize the cost function)

Challenges in Neural Network Optimization

Local Minima

One of the most prominent features of a convex optimization problem is that it can be reduced to the problem of finding a local minimum. Any local minimum is guaranteed to be a global minimum. Neural networks may have an extremely large number of local minima. Local minima can be a problem when it has high cost in comparison to the global minimum, since then our algorithm will be stuck in the local minima and it cannot find the global minimum.

(The example shows that when the optimization algorithm find the local minima, it will be stuck there, and hence, it can not find the global minima of the cost function)

However, experts now suspect that, for sufficiently large neural networks, most local minima have a low cost function value, and that it is not important to ﬁnd a true global minimum rather than to ﬁnd a point in parameter space that has low but not minimal cost.

Saddle Points and Other Flat Regions

For many high-dimensional nonconvex functions, local minima (and maxima)are in fact rare compared to another kind of point with zero gradient: a saddle point. A saddle point is a critical point where the gradient is zero, but it neither a local minimum or a local maximum. In higher-dimension space, local minima are rare, and saddle points are more common. We can think of a saddle point as being a local minimum along one cross-section of the cost function and a local maximum along another cross-section.

(The given image shows 3 situation of local minima, local maxima and saddle point. In the region near the saddle point, the gradient is small that make the algorithm slower to converge)

There may also be wide, ﬂat regions of constant value. In these locations, the gradient and the Hessian are all zero. Such degenerate locations pose major problems for all numerical optimization algorithms. In a convex problem, a wide, ﬂat region must consist entirely of global minima, but in a general optimization problem, such a region could correspond to a high value of the objective function.

Long-term Dependencies

Another diﬃculty that neural network optimization algorithms must overcome arises when the computational graph becomes extremely deep. Feedforward networks with many layers have such deep computational graphs.

During backpropagation, the gradient are repeatedly multiplied across many layers. If the weights are large, the gradients may grow exponentially. This lead to extremely large updates to the model’s weights, causing the learning process to become unstable and often resulting in numerical overflow.

On the opposite side, if the weights are very small, the gradients may approach zero as they move towards the input layers, hence the model’s weight in the early layer are mostly not updated. Notice that the choice of activation function such as sigmoid or tanh function can also cause small gradients, since when the input is extremely large or small, $f’(x) \rightarrow 0$ , i.e, when f(x) is sigmoid or tanh function.

Some Gradient Descent Algorithms

Momentum

The method of momentum is designed to accelerate learning, especially in the face of high curvature, small but consistent gradients, or noisy gradients. The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction.

Formally, the momentum algorithm introduces a variable v that plays the role of velocity - it is the direction and speed at which the parameters move through parameter space. The velocity is set to an exponentially decaying average of the negative gradient. The name momentum derives from a physical analogy, in which the negative gradient is a force moving a particle through parameter space, according to Newton’s laws of motion. Momentum in physics is mass times velocity. In the momentum learning algorithm, we assume unit mass, so the velocity vector v may also be regarded as the momentum of the particle. A hyperparameter $\beta \in [0, 1)$ determines how quickly the contributions of previous gradients exponentially decay. Given $\alpha$ is a global learning rate, for each step:

g \leftarrow \nabla_\theta J(f(\theta, x), y)

v \leftarrow \beta v + \alpha g

\theta \leftarrow \theta - v

AdaGrad

The AdaGrad algorithm, individually adapts the learning rates of all model parameters by scaling them inversely proportional to the square root of the sum of all the historical squared values of the gradient. The parameters with the largest partial derivative of the loss have a correspondingly rapid decrease in their learning rate, while parameters with small partial derivatives have a relatively small decrease in their learning rate. The net eﬀect is greater progress in the more gently sloped directions of parameter space.

Given $\delta = 10^{-7}$ is a really small constant to avoid divided by zero, we initialize the gradient accumulation $r = 0$ . For each step:

g \leftarrow \nabla_\theta J(f(\theta, x), y)

r \leftarrow r + g \odot g

\theta \leftarrow \theta -\frac{\alpha}{\delta + \sqrt{r}}g

RMSProp

The RMSProp algorithm modiﬁes AdaGrad to perform better in the nonconvex setting by changing the gradient accumulation into an exponentially weighted moving average. AdaGrad shrinks the learning rate according to the entire history of the squared gradient and may have made the learning rate too small before arriving at such a convex structure. RMSProp uses an exponentially decaying average to discard history from the extreme past so that it can converge rapidly after ﬁnding a convex bowl, as if it were an instance of the AdaGrad algorithm initialized within that bowl. Given decay rate $0 < \rho < 1$ , the other parameters are defined the same as in AdaGrad:

g \leftarrow \nabla_\theta J(f(\theta, x), y)

r \leftarrow \rho r + (1-\rho )g \odot g

\theta \leftarrow \theta -\frac{\alpha}{\delta + \sqrt{r}}g

Adam

In the context of the earlier algorithms, it is perhaps best seen as a variant on the combination of RMSProp and momentum with a few important distinctions. First, in Adam, momentum is incorporated directly as an estimate of the ﬁrst-order moment (with exponential weighting) of the gradient. The most straightforward way to add momentum to RMSProp is to apply momentum to the rescaled gradients. The use of momentum in combination with rescaling does not have a clear theoretical motivation. Second, Adam includes bias corrections to the estimates of both the ﬁrst-order moments (the momentum term) and the (uncentered) second-order moments to account for their initialization at the origin.

Given $\rho _1$ and $\rho _2$ in $[0, 1)$ are exponential decay rates for moment estimates and a small constant $\delta$ for numerical stabilization. Firstly, we initialize 2 moment accumulation variables $s = 0$ and $r = 0$ :

g \leftarrow \nabla_\theta J(f(\theta, x), y)

s \leftarrow \rho_1s + (1-\rho_1)g

r \leftarrow \rho_2 r + (1-\rho_2)g \odot g

\hat{s} \leftarrow \frac{s}{1-\rho_1^t}

\hat{r} \leftarrow \frac{r}{1 - \rho_2^t}

\theta \leftarrow \theta - \alpha\frac{\hat{s}}{\sqrt{\hat{r}}+\delta}

References

Ian Goodfellow and Yoshua Bengio and Aaron Courville. (2016). Deep Learning.
Multi-layer Perceptron và Backpropagation (Feb 24, 2017). https://machinelearningcoban.com/2017/02/24/mlp

Introduction to MLP and Optimization algorithms in Deep Learning