$ \newcommand{\CC}{\mathcal{C}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CS}{\mathcal{S}} \newcommand{\RR}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\P}{\mathbb{P}} \newcommand{\In}{\mathbf{1}} \newcommand{\df}{\nabla f} \newcommand{\ddf}{\nabla^2 f} \newcommand{\ra}{\rightarrow} \newcommand{\la}{\leftarrow} \newcommand{\inv}{^{-1}} \newcommand{\maximize}{\mbox{maximize }} \newcommand{\minimize}{\mbox{minimize }} \newcommand{\subto}{\mbox{subject to }} \newcommand{\tr}{^{\intercal}} $

Data Sampling and Optimization

In the recent years, a vast amount of data is collected. This fact affects the optimization problems that are solved in various fields from finance to machine learning. A considable fraction of data-driven large-scale problems arises from parameter estimation models. These estimations require minimizing the divergence of the predicted parameter values from their observed values. As data is large, a typical objective function involves summation of a huge number of functions.

Let us give an example problem from speech recognition. Consider a set of training data consisting of few milliseconds of speech frames $(x^{(k)}; y^{(k)})$, $k=1, \dots, N$. Here, $x^{(k)} \in \RR^n$ is the features vector and $y^{(k)} \in \CC$ is the label representing the phonetic state. The goal is to maximize the conditional probability of the correct phonetic state given the observed features vector.

The mathematical model defines a weight vector $\theta_l \in \RR^n$ for each label $l \in \CC$, and assigns each frame $k$ a probability of having a certain label $j \in \CC$:

$$ \P\left(y^{(k)}=j~|~x^{(k)}; \theta\right) = \frac{\exp{(\theta_j\tr x^{(k)})}}{\sum_{l \in \CC}\exp{(\theta_l\tr x^{(k)})}}. $$

For notational convenience, we store $\theta$ as a $|\CC| \times n$ matrix. Then, the scaled log-likelihood function is given by

$$ \frac{1}{N} \sum_{k=1}^{N} \underset{l_k(\theta)}{\underbrace{\sum_{j \in \CC} \In\{y^{(k)} = j\} \log\frac{\exp{(\theta_j\tr x^{(k)})}}{\sum_{l \in \CC}\exp{(\theta_l\tr x^{(k)})}}}}, $$

where $\In$ is the indicator function that takes the value 1, if the statement passed to it is true; otherwise it returns 0. Then, the function

$$ J(\theta) = -\frac{1}{N} \sum_{k=1}^{N} l_k(\theta) $$

is called the sample average approximation. Note that we have a minus sign in front of the function as we shall later minimize the log-likelihood instead of maximizing. In fact $J(\theta)$ is an approximation to the expected loss

$$ \E[l(\theta)] = -\int_z l(\theta | z)~p(z)~dz, $$

where $p$ is the unknown distribution of the data. Under reasonable assumptions, we have

$$ \P\left(\lim_{N \ra \infty} J(\theta) = \E[l(\theta)]\right) = 1. $$

When we minimize the sample average approximation, we obtain the optimal solution

$$ \theta^* = \arg\min_{\theta} J(\theta), $$

which is also known as the maximum likelihood estimator. Here is the sequence of approximations that we expect from this learning problem

$$ J(\theta^*) \approx \E[l(\theta^*)] \approx \min_{\theta}\E[l(\theta)]. $$

The left most value is obtained with the empirical objective function, the middle value is obtained with the real objective function. The right most one is the real optimization problem that we like to solve. However, the model is not available as data distribution is not known.

Differentiable Problem

A conventional gradient descent approach for minimizing $J$ sets an initial $\theta^{(1)}$ and then, proceeds by evaluating at iterations $i=1, 2, \dots$ the following relation

$$ \theta^{(i+1)} = \theta^{(i)} - \alpha_i \nabla J(\theta^{(i)}) = \theta^{(i)} - \alpha_i \frac{1}{N} \sum_{k=1}^{N} \nabla l_k(\theta^{(i)}), $$

where the step length $\alpha_k$ is also known as learning rate. As we need to go through the entire data set, this method is called the batch gradient descent. The learning rate can be computed by line search or set to an appropriate constant. However, in general, the sequence of $\alpha_i$ satisfies the following two conditions:

$$ \alpha_i \underset{i \uparrow \infty}{\ra} 0 ~\mbox{ and } \sum_{i=1}^{\infty} \alpha_i = \infty. $$

A simple but poor choice is $\alpha_i = i^{-1}$ as it converges to 0 very fast. Another choice is $\alpha_i = 10^{-4} i^{-0.5}$.

In order to solve this problem with gradient descent, we need to evaluate $\nabla J(\theta)$. It is not difficult to show that

$$ \nabla_{\theta_j} J(\theta) = -\frac{1}{N} \sum_{k=1}^{N} \left(x^{(k)}\left(\In\{y^{(k)}=j\} - \frac{\exp{(\theta_j\tr x^{(k)})}}{\sum_{l \in \CC}\exp{(\theta_l\tr x^{(k)})}}\right)\right). $$

Thus, we can apply the gradient descent algorithm for each $j \in \CC$ by

$$ \theta_{j}^{(i+1)} = \theta_j^{(i)} - \alpha_i \nabla_{\theta_j} J(\theta^{(i)}). $$

Homework: Derive the formula for $\nabla_{\theta_j} J(\theta)$.

Now, we generate our synthetic multiclass classification problem to test our batch gradient descent algorithm.

In [1]:
pkg load statistics;
xc = [1, 4; 4, 3; 3, 1]; % Centroids
N = 100; n=2; 
C = 3; % Three labels
x = zeros(N, n);
y = zeros(N, 1);
clf; hold on;
for k=1:N
    l = randperm(C, 1); % Select one class randomly
    x(k, :) = mvnrnd(xc(l, :), 0.2*eye(2));
    y(k) = l;
    if (l==1)
        scatter(x(k, 1), x(k, 2), 20, 'b');
    elseif (l==2)
        scatter(x(k, 1), x(k, 2), 20, 'r');
    else
        scatter(x(k, 1), x(k, 2), 20, 'g');
    end
end
Gnuplot Produced by GNUPLOT 5.0 patchlevel 3 -1 0 1 2 3 4 5 0 1 2 3 4 5 gnuplot_plot_1a gnuplot_plot_2a gnuplot_plot_3a gnuplot_plot_4a gnuplot_plot_5a gnuplot_plot_6a gnuplot_plot_7a gnuplot_plot_8a gnuplot_plot_9a gnuplot_plot_10a gnuplot_plot_11a gnuplot_plot_12a gnuplot_plot_13a gnuplot_plot_14a gnuplot_plot_15a gnuplot_plot_16a gnuplot_plot_17a gnuplot_plot_18a gnuplot_plot_19a gnuplot_plot_20a gnuplot_plot_21a gnuplot_plot_22a gnuplot_plot_23a gnuplot_plot_24a gnuplot_plot_25a gnuplot_plot_26a gnuplot_plot_27a gnuplot_plot_28a gnuplot_plot_29a gnuplot_plot_30a gnuplot_plot_31a gnuplot_plot_32a gnuplot_plot_33a gnuplot_plot_34a gnuplot_plot_35a gnuplot_plot_36a gnuplot_plot_37a gnuplot_plot_38a gnuplot_plot_39a gnuplot_plot_40a gnuplot_plot_41a gnuplot_plot_42a gnuplot_plot_43a gnuplot_plot_44a gnuplot_plot_45a gnuplot_plot_46a gnuplot_plot_47a gnuplot_plot_48a gnuplot_plot_49a gnuplot_plot_50a gnuplot_plot_51a gnuplot_plot_52a gnuplot_plot_53a gnuplot_plot_54a gnuplot_plot_55a gnuplot_plot_56a gnuplot_plot_57a gnuplot_plot_58a gnuplot_plot_59a gnuplot_plot_60a gnuplot_plot_61a gnuplot_plot_62a gnuplot_plot_63a gnuplot_plot_64a gnuplot_plot_65a gnuplot_plot_66a gnuplot_plot_67a gnuplot_plot_68a gnuplot_plot_69a gnuplot_plot_70a gnuplot_plot_71a gnuplot_plot_72a gnuplot_plot_73a gnuplot_plot_74a gnuplot_plot_75a gnuplot_plot_76a gnuplot_plot_77a gnuplot_plot_78a gnuplot_plot_79a gnuplot_plot_80a gnuplot_plot_81a gnuplot_plot_82a gnuplot_plot_83a gnuplot_plot_84a gnuplot_plot_85a gnuplot_plot_86a gnuplot_plot_87a gnuplot_plot_88a gnuplot_plot_89a gnuplot_plot_90a gnuplot_plot_91a gnuplot_plot_92a gnuplot_plot_93a gnuplot_plot_94a gnuplot_plot_95a gnuplot_plot_96a gnuplot_plot_97a gnuplot_plot_98a gnuplot_plot_99a gnuplot_plot_100a

There are three classes in this elementary example. Below, we first set up the sample average approximation function and then solve the resulting model with a simple batch gradient descent.

To run the following code, you need to unzip this file into the same directory.

In [4]:
addpath('./octave');
theta0 = rand(C, n);
theta = theta0;
alpha = 1;
maxiter = 100;
for i=1:maxiter
    DJ = saa(x, y, theta, 'bgd');
    theta = theta - alpha*DJ;
    alpha = 1.0e-4*(1/sqrt(i));
    if (norm(DJ, 'inf')<1.0e-4)
        break;
    end
end
theta
theta =

   0.69879   0.59049
   0.95486   0.66443
   1.35700   1.28932

We have obtained the parameter $\theta$. For a given point, we can now calculate the probabilities that it belongs to one of these classes.

In [5]:
xtest = [2; 1];
fprintf('Point (%.2f, %.2f)\n', xtest(1), xtest(2));

denom = 0;
for l=1:C
    denom = denom + exp(theta(l, :)*xtest);
end
    
for j=1:C
    num = exp(theta(j, :)*xtest);
    fprintf('\tbelongs to class %d with probability %f\n', j, num/denom);
end
Point (2.00, 1.00)
	belongs to class 1 with probability 0.097093
	belongs to class 2 with probability 0.174467
	belongs to class 3 with probability 0.728440

One iteration of the batch gradient descent evaluates the full gradient $\nabla J$, which requires $N \times |\CC|$ function calls. Stochastic gradient descent on the other hand simplies this gradient calculation by sampling only one of the component functions. Suppose the function $\hat{k}$ is randomly selected, then the iterations proceed with the following relation

$$ \theta^{(i+1)} = \theta^{(i)} - \alpha_i \nabla l_{\hat{k}}(\theta^{(i)}). $$

Next we solve the same problem with the stochastic gradient descent. You should notice that the computation time with the batch gradient descent is significantly longer than the time with the stochastic gradient descent. This is an expected behavior as the stochastic gradient descent requires only $|\CC|$ function calls.

In [7]:
theta = theta0;
alpha = 1;
maxiter = 100;
for i=1:maxiter
    DJ = saa(x, y, theta, 'sgd');
    theta = theta - alpha*DJ;
    alpha = 1.0e-4*(1/sqrt(i));
    if (norm(DJ)<1.0e-4)
        break;
    end
end
theta
theta =

   0.079214  -0.029082
   0.131385  -0.159045
   2.647653   2.579971

This discussion about gradient-based methods is quite rough. For instance, instead of selecting just one $l_k$ randomly, it is possible to select a subset from $\{1, \dots, N\}$. This is known as the minibatch approach. All these variants of gradient methods use the first order information. There are also matrix-free approaches that use second order information. It is also easy to imagine parallel implementations of (mini) batch gradient methods.

Nondifferentiable Problem

Consider now the binary classification problem. Then, $y^{(k)} \in \{0, 1\}$. This problem can be modeled as a linear separation problem, where we try to split the points $x^{(k)}$ into two sets with a hyperplane. Formally,

$$ \left. \begin{array}{rcl} \theta\tr x^{(k)} - c \leq -1 & \implies & y^{(k)} = -1 \\ \theta\tr x^{(k)} - c \geq 1 & \implies & y^{(k)} = 1 \end{array} \right\} (\theta\tr x^{(k)} - c)y^{(k)} = 1. $$

We next define

$$ l_k(\theta) = \max\{0, 1 - (\theta\tr x^{(k)} - c)y^{(k)}\}, $$

which is also known as the hinge loss. Then, the objective function becomes

$$ \min J_\lambda(\theta) = \frac{1}{N} \sum_{k=1}^N l_k(\theta) + \frac{\lambda}{2} \|\theta\|^2. $$

The last term above is used to avoid overfitting. Overall, this objective function is not differentiable due to $\max$ function. However, we can use the subgradient method as the subdifferential set for each function $l_k$ is easy to evaluate

$$ \partial l_k(\theta) = \left\{ \begin{array}{rl} -y^{(k)}x^{(k)}, & \mbox{if } (\theta\tr x^{(k)} - c)y^{(k)} < 1, \\ 0, & \mbox{if } (\theta\tr x^{(k)} - c)y^{(k)} > 1, \\ -\beta y^{(k)}x^{(k)}, \beta\in[0,1]& \mbox{if } (\theta\tr x^{(k)} - c)y^{(k)} = 1. \end{array} \right. $$

It is then relatively easy to write the iterations of the stochastic subgradient descent method. Suppose at iteration $i$, the function $\hat{k}$ is randomly selected. Then, we have

$$ \theta^{(i+1)} = \theta^{(i)} - \alpha_i (g_{\hat{k}} + \lambda \theta^{(i)}), $$

where $g_{\hat{k}} \in \partial l_k(\theta^{(i)})$.

Here is a simple binary classification problem where the points $x^{(k)}$ are in $\RR^2$.

In [9]:
xc = rand(2)*4 + 1;% Centroids
N = 100; n=2; 
C = 2; % Two labels
x = zeros(N, n);
y = zeros(N, 1);
clf; hold on;
for k=1:N
    l = randperm(C, 1); % Select one class randomly
    x(k, :) = mvnrnd(xc(l, :), 0.2*eye(2));
    y(k) = l;
    if (l==1)
        scatter(x(k, 1), x(k, 2), 20, 'b');
    else
        scatter(x(k, 1), x(k, 2), 20, 'r');
    end
end
Gnuplot Produced by GNUPLOT 5.0 patchlevel 3 1.5 2 2.5 3 3.5 4 2.5 3 3.5 4 4.5 5 5.5 6 gnuplot_plot_1a gnuplot_plot_2a gnuplot_plot_3a gnuplot_plot_4a gnuplot_plot_5a gnuplot_plot_6a gnuplot_plot_7a gnuplot_plot_8a gnuplot_plot_9a gnuplot_plot_10a gnuplot_plot_11a gnuplot_plot_12a gnuplot_plot_13a gnuplot_plot_14a gnuplot_plot_15a gnuplot_plot_16a gnuplot_plot_17a gnuplot_plot_18a gnuplot_plot_19a gnuplot_plot_20a gnuplot_plot_21a gnuplot_plot_22a gnuplot_plot_23a gnuplot_plot_24a gnuplot_plot_25a gnuplot_plot_26a gnuplot_plot_27a gnuplot_plot_28a gnuplot_plot_29a gnuplot_plot_30a gnuplot_plot_31a gnuplot_plot_32a gnuplot_plot_33a gnuplot_plot_34a gnuplot_plot_35a gnuplot_plot_36a gnuplot_plot_37a gnuplot_plot_38a gnuplot_plot_39a gnuplot_plot_40a gnuplot_plot_41a gnuplot_plot_42a gnuplot_plot_43a gnuplot_plot_44a gnuplot_plot_45a gnuplot_plot_46a gnuplot_plot_47a gnuplot_plot_48a gnuplot_plot_49a gnuplot_plot_50a gnuplot_plot_51a gnuplot_plot_52a gnuplot_plot_53a gnuplot_plot_54a gnuplot_plot_55a gnuplot_plot_56a gnuplot_plot_57a gnuplot_plot_58a gnuplot_plot_59a gnuplot_plot_60a gnuplot_plot_61a gnuplot_plot_62a gnuplot_plot_63a gnuplot_plot_64a gnuplot_plot_65a gnuplot_plot_66a gnuplot_plot_67a gnuplot_plot_68a gnuplot_plot_69a gnuplot_plot_70a gnuplot_plot_71a gnuplot_plot_72a gnuplot_plot_73a gnuplot_plot_74a gnuplot_plot_75a gnuplot_plot_76a gnuplot_plot_77a gnuplot_plot_78a gnuplot_plot_79a gnuplot_plot_80a gnuplot_plot_81a gnuplot_plot_82a gnuplot_plot_83a gnuplot_plot_84a gnuplot_plot_85a gnuplot_plot_86a gnuplot_plot_87a gnuplot_plot_88a gnuplot_plot_89a gnuplot_plot_90a gnuplot_plot_91a gnuplot_plot_92a gnuplot_plot_93a gnuplot_plot_94a gnuplot_plot_95a gnuplot_plot_96a gnuplot_plot_97a gnuplot_plot_98a gnuplot_plot_99a gnuplot_plot_100a

Homework: Solve this problem with the stochastic subgradient descent method.


Papers for Next Week's Discussion

  • Bottou, L. (2012). Stochastic gradient descent tricks, Neural Networks, Tricks of the Trade, Reloaded, edited by Grégoire Montavon, G., Orr, G. B. and K.-R. Müller, Lecture Notes in Computer Science (LNCS 7700), Springer, 430-445.
  • Le, W. V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B. and A. Y. Ng (2011). On optimization methods for deep learning, Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA.
  • Torralba, A., Fergus, R. and Y. Weiss (2008). Small codes and large image databases for recognition, IEEE Conference on on Computer Vision and Pattern Recognition (CVPR 2008), Anchorage, AK, 1-8.

References

  1. Birbil, Ş. İ. (2015). IE 582 - Optimization for Big Data, Class Notes, Sabancı University.