$ \newcommand{\CC}{\mathcal{C}} \newcommand{\CF}{\mathcal{F}} \newcommand{\CI}{\mathcal{I}} \newcommand{\CL}{\mathcal{L}} \newcommand{\CJ}{\mathcal{J}} \newcommand{\CS}{\mathcal{S}} \newcommand{\RR}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\P}{\mathbb{P}} \newcommand{\In}{\mathbf{1}} \newcommand{\df}{\nabla f} \newcommand{\ddf}{\nabla^2 f} \newcommand{\ra}{\rightarrow} \newcommand{\la}{\leftarrow} \newcommand{\inv}{^{-1}} \newcommand{\maximize}{\mbox{maximize }} \newcommand{\minimize}{\mbox{minimize }} \newcommand{\subto}{\mbox{subject to }} \newcommand{\tr}{^{\intercal}} $

Function Decomposition

In this lecture, we shall cover methods that were originally proposed several decades ago. Recently, these methods have become popular again because they are apt for distributed implementation, and hence, can be used for solving large-scale problems. These notes borrow largely from Boyd et al. (2010).

Recall our discussion on Lagrangian relaxation. Suppose that we are given the following convex optimization problem:

$$ \begin{array}{ll} \minimize & f(x) \\ \subto & Ax = b, \end{array} $$

where $f:\RR^n \ra \RR$ is a convex function and $A$ is an $m \times n$ matrix. Then, using the Lagrange multiplier vector $\lambda \in \RR^m$, we can write the Lagrangian function

$$ \CL(x, \lambda) = f(x) + \lambda\tr (Ax - b) $$

and the Lagrangian dual objective function

$$ Z_L(\lambda) = \inf_{x \in \RR^n} ~ \CL(x, \lambda). $$

This leads to our Lagrangian dual problem

$$ \begin{array}{ll} \maximize & Z_L(\lambda) \\ \subto & \lambda \in \RR^m. \end{array} $$

When $f$ is strictly convex, the optimal objective function value of the dual problem is equal to the optimal objective function value of our original (primal) problem. In this case, given the dual optimal solution $\lambda^*$, we can obtain the primal optimal solution $x^*$ by solving

$$ x^* = \arg\underset{x}{\min} ~ \CL(x, \lambda^*). $$

The dual ascent method is used for solving the Lagrangian dual problem. At every iteration, we fix the dual variable and obtain the corresponding primal solution. Then, we update the dual variable using the obtained primal solution and proceed to the next iteration. If the Lagrangian dual objective function is differentiable, then at every iteration we do the following updates:

$$ \begin{array}{l} x^{(k+1)} = \arg\underset{x}{\min} ~ \CL(x, \lambda^{(k)}),\\ \lambda^{(k+1)} = \lambda^{(k)} + \alpha_k(Ax^{(k+1)} - b), \end{array} $$

where $\alpha_k > 0$ is the step-length. Note that the updates for the dual solution is nothing but the gradient ascent applied to the Lagrangian dual problem for fixed $x^{(k+1)}$. As we have covered in our past lectures, if the Lagrangian dual objective function is not differentiable, then we can still apply the dual ascent method but in that case we need to use the subgradients.

Here is a simple example. Consider rank-one factorization of a matrix. Suppose that one of the factors is given and we want to find the other factor that has to lie in the nullspace of another matrix. We can obtain a solution to this problem by solving the following problem:

$$ \begin{array}{ll} \minimize & \frac{1}{2}\|U - ux\tr\|^2 \\ \subto & Ax = 0, \end{array} $$

where $U$ is the $N \times n$ data matrix, $A$ is a $m \times n$ matrix and $u \in \RR^N$ is the fixed factor.

To run the following code, you need to unzip this file into the same directory.

In [4]:
addpath('./octave'); 
maxiter = 100; x0 = rand(3,1); lambda0 = rand(2,1);
A = [3, 2, 1; 0, -1, 1]; m=2; b = zeros(m, 1);
u = [4; -3; 2; 5]; N = 4;
xopt = [-1; 1; 1]; n = 3;
U = u*xopt';
fun = @(x, lambda)lagrangian(x, lambda, U, u, A, b);

x = x0; lambda = lambda0;
for k=1:maxiter
    alpha = 1.0e-2*(1/sqrt(k));
    x = fminunc(@(x)fun(x,lambda), x);
    lambda = lambda + alpha*(A*x - b);
end
[xopt'; x']
ans =

  -1.00000   1.00000   1.00000
  -1.04815   0.98491   0.96694

Now we are ready to introduce the first method of this lecture.

Dual Decomposition

Suppose that the objective function of the primal problem is separable. That is

$$ f(x) = \sum_{i=1}^N f_i(x_i). $$

Here, the variables $x_i$ are not necessarily scalars. They can as well be vectors with different dimensions and they partition the vector $x \in \RR^n$ into $N$ parts. We can apply the same partitioning to the matrix $A$ and write the constraints as

$$ Ax = \sum_{i=1}^N A_i x_i. $$

The Lagrangian function then becomes

$$ \CL(x, \lambda) = f(x) + \lambda\tr (Ax - b)= \sum_{i=1}^N \underset{\CL_i(x_i, \lambda)}{\underbrace{f_i(x_i) + \lambda\tr(A_ix_i - \frac{1}{N}b)}}. $$

Clearly, the Lagrangian function is also separable. Thus, if we apply the dual ascent method, then the updates for the primal solution can be done in parallel:

$$ \begin{array}{ll} \left. \begin{array}{l} x_1^{(k+1)} = \arg\underset{x_1}{\min} ~ \CL_1(x_1, \lambda^{(k)}),\\ \vdots \\ x_N^{(k+1)} = \arg\underset{x_N}{\min} ~ \CL_N(x_N, \lambda^{(k)}),\\ \end{array} \right\} & N \mbox{ independent problems} \\ \lambda^{(k+1)} = \lambda^{(k)} + \alpha_k(Ax^{(k+1)} - b). & \end{array} $$

Since we decompose the primal udpates in the dual ascent method, we call this approach dual decomposition. We can in fact try this method on our previous problem as its objective function can also be written as a separable function:

$$ f(x)=\frac{1}{2}\|U - ux\tr\|^2 = \sum_{i=1}^N\underset{f_i(x_i)}{\underbrace{\sum_{j=1}^n \frac{1}{2}(U_{ji} - u_jx_i)^2}}. $$

Homework: Solve our previous example with the dual decomposition method. You can use parfor instead of for to parallelize the $N$ independent optimization problems.

Method of Multipliers

The dual decomposition method requires several assumptions. Chief among these is the strict convexity of $f$. To overcome these restrictions, we work with an augmented problem:

$$ \begin{array}{ll} \minimize & f(x) + \frac{\rho}{2}\|Ax-b\|^2 \\ \subto & Ax = b, \end{array} $$

where $\rho > 0$ is called the penalty parameter. Then, the Lagrangian function for this penalized problem becomes

$$ \CL(x, \lambda; \rho) = f(x) + \lambda\tr (Ax - b) + \frac{\rho}{2}\|Ax-b\|^2, $$

which is called the augmented Lagrangian function. The Lagrangian dual objective function is then given by

$$ Z_L(\lambda; \rho) = \inf_{x \in \RR^n} ~ \CL(x, \lambda; \rho). $$

Here is the key role of the augmented Lagrangian function: The function $Z_L(\lambda; \rho)$ can be shown to be differentiable under weaker assumptions.

Next, we apply the dual ascent updates and obtain the relations

$$ \begin{array}{l} x^{(k+1)} = \arg\underset{x}{\min} ~ \CL(x, \lambda^{(k)}; \rho),\\ \lambda^{(k+1)} = \lambda^{(k)} + \rho(Ax^{(k+1)} - b). \end{array} $$

This approach is called the method of multipliers. Note that the penalty parameter $\rho$ is used as the fixed step length in the algorithm. This specific choice for the step length becomes clear when we observe the optimality conditions of the original problem. For simplicity, suppose that $f$ is differentiable. Then,

$$ \begin{array}{c} Ax^* - b = 0, \\ \nabla f(x^*) + A\tr\lambda^* = 0, \end{array} $$

where the first condition is the primal feasibility and the second condition is the dual feasibility. If we now check what happens at iteration $k+1$. Since $x^{(k+1)}$ is a stationary point of $\CL(x, \lambda^{(k)}; \rho)$, we have

$$ \nabla_x \CL(x^{(k+1)}, \lambda^{(k)}; \rho) = 0. $$

This implies

$$ \nabla_x~f(x^{(k+1)}) + A\tr(\lambda^{(k)} + \rho(Ax^{(k+1)}-b)) = \nabla_x~f(x^{(k+1)}) + A\tr\lambda^{(k+1)} = 0. $$

Thus, the dual feasibility is preserved throughout the iterations when we set the step length to $\rho$, and the iterations continue until we have primal feasibility. In other words, the iterations stop when $Ax^{(k)} - b$ converges to zero.

The method of multipliers enjoys much better convergence properties than the dual decomposition. However, due to the penalty term, the augmented Lagrangian function is not separable even if $f$ is separable. Therefore, it is not straightforward to come up with a distributed implementation. Recently, a new approach called alternating direction method of multipliers has been proposed to parallelize the iterations.

In [5]:
rho = 10;
fun = @(x,lambda)auglagrangian(x, lambda, U, u, A, b, rho);

x = x0; lambda = lambda0;
for k=1:maxiter
    x = fminunc(@(x)fun(x,lambda), x, options);
    lambda = lambda + rho*(A*x - b);
end
[xopt'; x']
ans =

  -1.00000   1.00000   1.00000
  -1.00000   1.00000   1.00000


Papers for Next Week's Discussion

  • Samar, S., Boyd, S. and D. Gorinevsky (2007). Distributed estimation via dual decomposition, Proceedings of the European Control Conference, Kos, Greece, 1511-1516.
  • Wei, E. and A. Ozdaglar (2012). Distributed alternating direction method of multipliers, 51st IEEE Conference on Decision and Control (CDC).
  • Sun, D. L. and C. Fèvotte (2014). Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 6201 - 6205.

References

  1. Boyd, S., Parikh, N., Chu, E., Peleato, B. and J. Eckstein (2010), Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, Foundations and Trends in Machine Learning, 3(1), 1-122.