Nonlinear Programming with No Constraints - GOING DEEPER: INSIDE THE CODES AND THEORETICAL ASPECTS - Optimization in Practice with MATLAB for Engineering Students and Professionals (2015)

Optimization in Practice with MATLAB for Engineering Students and Professionals (2015)

PART 4


GOING DEEPER: INSIDE THE CODES AND THEORETICAL ASPECTS

12


Nonlinear Programming with No Constraints

12.1 Overview

Nonlinear programming is a technique used to solve optimization problems that involves nonlinear mathematical functions (e.g., objective functions or constraints, or both). The methods presented in this chapter are capable of solving unconstrained optimization problems with unimodal objectives. The procedures used for these methods are illustrated via minimization problems. Maximization problems can be converted to minimization problems by multiplying the objective functions by -1.

Unconstrained nonlinear programming has many practical applications. Many real life design problems have nonlinear objectives due to the nonlinearities in nature. Some design problems do not have constraints, or the constraints are negligible. Constrained optimization problems can be reformulated as unconstrained problems. The methods used to solve unconstrained nonlinear problems are powerful tools for solving optimization problems.

12.2 Necessary and Sufficient Conditions

Chapter 2 provides mathematical knowledge concerning the first and second derivatives of functions. Local optima can be determined by their derivatives. This chapter provides the conditions for local minima. These conditions involve the gradients (first derivative vectors) and Hessian matrices (second derivative matrices) for the objective functions. These conditions are the foundation for several of the algorithms described in this chapter (see Refs. [3, 1]).

Assume a function, f(x), is continuous and has continuous first and second derivatives. The function can be expressed as a Taylor expansion.

(12.1)

Neglecting the higher order terms, O3x), the objective function change can be expressed as

(12.2)

If x*is a local minimum, any other points in its neighborhood should produce a greater objective value, which can be expressed as

(12.3)

From Eq. 12.2, in order for the sign of Δf(x) to be known (e.g., positive for a minimum at x*) for arbitrary values of Δx, the first derivative of f(x) should be zero. Otherwise, Δf(x) can be forced to be positive or negative by changing the sign of Δx, and Eq. 12.3 cannot be satisfied. Therefore, a local minimum, x*, should satisfy the following first-order necessary conditions.

Theorem (First-Order Necessary Conditions)

If x*is a local minimum of f(x) and f(x) is continuously differentiable in an open neighborhood of x*, then the gradient ∇f(x*) = 0.

Since ∇f(x*) = 0 at the local minimum, Eq. 12.2 becomes

(12.4)

It should satisfy Δf(x) ≥ 0 since x*is a local minimum. Then, ∇2f(x*) should be positive semidefinite, which is stated as the second-order necessary conditions as follows.

Theorem (Second-Order Necessary Conditions)

If x*is a local minimum of f(x) and ∇2f(x) exists and is continuous in an open neighborhood of x*, then ∇f(x*) = 0 and ∇2f(x*) is positive semidefinite.

The second-order necessary conditions do not guarantee a local minimum. At x = 0, the function, f(x) = x3, satisfies the first-order and second-order necessary conditions. However, x = 0 is not a local minimum of f(x) = x3.

If the Hessian of f(x) is positive definite at the point where ∇f(x) = 0, then

(12.5)

Equation 12.5 guarantees the point, x*, is a strict local minimum. It is stated as the second-order sufficient condition. The second-order sufficient conditions are stronger than the second-order necessary conditions.

Theorem (Second-Order Sufficient Conditions)

Given f(x), if ∇2f(x) is continuous in an open neighborhood of x* and ∇f(x*) = 0; if ∇2f(x*) is positive definite, then x*is a strict local minimum of f(x).

12.3 Single Variable Optimization

Some unconstrained nonlinear optimization problems may only have a single variable. In this section, the methods for solving single variable nonlinear optimization problems are presented. We parenthetically note that the Bisection and the Golden search methods require bounds which, technically, are constraints. However, these bounds need not have physical meanings, and can be arbitrarily chosen for the sole purpose of implementing these algorithms.

12.3.1 Interval Reduction Methods

Bisection

The interval reduction methods are applicable to single variable nonlinear optimization problems whose variables have lower bounds and upper bounds.

The Bisection method (also known as the Interval Halving method) finds an extreme point in a bounded region for a unimodal single variable function. The function and its first derivative are assumed continuous. The Bisectionmethod successively halves the interval, and decides in which one of the two half intervals the extreme point exists. The Bisection method uses the first derivative of the function to determine in which half the extreme point lies.Figure 12.1 illustrates how to update the interval according to the sign of the first derivative. Suppose the variable x of a unimodal convex function f(x) is inside the interval [a,b]. The function f(x) and its derivative f′(x) are continuous over this interval. The first derivatives at the two ends satisfy the condition f′(a)f′(b) < 0. This condition implies that there is a minimum in the interval [a,b] (with 0 derivative). The Bisection method procedure forfinding the minimum is outlined as follows.

Figure 12.1. Interval Updates of the Bisection Method

1.Specify the convergence tolerance for an interval of length > 0. Specify the convergence tolerance of the gradient γ > 0. Set the iteration number, k, to 0. The bounds are l0 = a and r0 = b.

2.If rk - lk < , stop. The midpoint xk = is taken as the minimum, x*, and the corresponding function value f(x*) is the optimal solution.

3.Evaluate the gradient of the function at the midpoint f′(xk) = f′().

4.If f′(xk) < γ, stop. The corresponsing xk is considered the minimum, x*, and the corresponding function value, f(x*), is the optimal solution.

5.Evaluate the derivative f′(xk). If it is positive, let lk+1 = lk and rk+1 = xk. If it is negative, let lk+1 = xk and rk+1 = rk.

6.k = k + 1. Go to Step 2.

Example: The following optimization problem is used to illustrate the Bisection method.


(12.6)

Figure 12.2 presents the function curve in the defined region.

Figure 12.2. Plot of the Objective Function

The convergence tolerance of the interval length, , is set as 0.001. The convergence tolerance of the gradient at the midpoints, γ, is set as 0.01. Table 12.1 provides the end points, ak and bk, and the derivatives at each midpoint. The Bisection method stops after 10 iterations. At the 10th iteration, the distance between the two points, 0.0098, is larger than the convergence tolerance of the interval length, . The absolute value of the derivative at the midpoint is lessthan the convergence tolerance of the gradient at the midpoint, γ. It is determined that the optimal value is at the midpoint 3.9990. The corresponding function value is f(x*) = 15.0000.

Table 12.1. The End Points and the Derivatives at Each Midpoint



k

ak

bk

Derivative


1

0.0000

5.0000

-15.0000

2

2.5000

5.0000

-2.5000

3

3.7500

5.0000

3.7500

4

3.7500

4.3750

0.6250

5

3.7500

4.0625

-0.9375

6

3.9063

4.0625

-0.1563

7

3.9844

4.0625

0.2344

8

3.9844

4.0234

0.0391

9

3.9844

4.0039

-0.0586

10

3.9941

4.0039

-0.0098



Golden Section Search

The Golden Section Search is a method to minimize or maximize a unimodal function of one variable. The algorithm maintains the function values for triples of points whose distances form a golden ratio. This method does not require information about the derivatives. If the minimum is known to exist inside a region of the variable, this method successively narrows the range of function values inside which the minimum exists.

In mathematics, two quantities are in the golden ratio if the ratio of the sum of the quantities to the larger quantity is equal to the ratio of the larger quantity to the smaller quantity. Figure 12.3 illustrates the geometric relationshipthat defines the golden ratio. The total length is lac, the larger segment is lab, and the smaller segment is lbc. From Fig. 12.3, the golden ratio can be expressed as

Figure 12.3. Line Segments Divided According to the Golden Ratio

(12.7)

Solving Eq. 12.7, the value of φ is

(12.8)

The value of its conjugate is

(12.9)

Suppose the variable of a unimodal function f(x) is inside the interval [a,b], and f(x) is continuous over this interval. The minimum exists inside this interval. The steps of the Golden Section Search are illustrated as follows.

1.Select > 0 as the convergence tolerance of the interval. Set the iteration number, k, to 0. The bounds are a0 = a and b0 = b. Set l0 = b0 - τ(b0 - a0) and r0 = a0 + τ(b0 - a0).

2.If bk - ak < , stop. The midpoint, xk = , is considered the optimal value, x*; and the corresponding function value, f(x*), is the optimal solution.

3.If f(lk) > f(rk), the parameters are updated as shown in the f(lk) > f(rk) case in Fig. 12.4.

Figure 12.4. Interval Updates of the Golden Section Search

4.If f(lk) < f(rk), the parameters are updated as the f(lk) < f(rk) case in Fig. 12.4.

5.k = k +1, go to Step 2.

Inside each iteration, depending on f(lk) > f(rk) or f(lk) < f(rk), the objective function is evaluated at different new points. The two bounds of the interval are updated and the interval becomes shorter. At Step 2 of each iteration, the bounds are [ak,bk]. If f(lk) < f(rk), the right bound, bk, is updated in Step 4. The length of the new interval is

(12.10)

The length of the interval is reduced by a factor of τ at each iteration. The length reduction is similar to when f(lk) > f(rk).

The following observations are noteworthy. The Golden Section Search method requires one evaluation of the objective function at each iteration to determine the interval where the minimum lies; while the Bisection method requires one evaluation of the first derivative at each iteration.

Example: The same optimization example used for the Bisection method is again used to illustrate the Golden Section Search. The convergence tolerance rate, , is set as 0.1 for this example, which is different from the previousexample for the Bisection method.


(12.11)

Table 12.2 provides the four points, ak, bk, lk, and rk, and their function values. The Golden Section Search stops after 10 iterations. At the 10th iteration, the values of the objective function are compared at two points, l10 = 4.00and r10 = 4.03. It is determined that the optimal value is between the points, 3.95 and 4.03. The distance between the two points is less than the convergence tolerance, . Finally, the point x* = (3.95+4.03)2 = 3.99 is considered theoptimal value, and the corresponding function value is f(x*) = 15.00.

Table 12.2. Endpoints, Golden Section Points and Function Values at Each Iteration



k

ak

bk

f(ak )

f(bk )

lk

rk

f(lk )

f(rk )


1

0.00

10.00

95.00

195.00

3.82

6.18

15.16

38.77

2

0.00

6.18

95.00

38.77

2.36

3.82

28.44

15.16

3

2.36

6.18

28.44

38.77

3.82

4.72

15.16

17.60

4

2.36

4.72

28.44

17.60

3.26

3.82

17.72

15.16

5

3.26

4.72

17.72

17.60

3.82

4.16

15.16

15.13

6

3.82

4.72

15.16

17.60

4.16

4.38

15.13

15.71

7

3.82

4.38

15.16

15.71

4.03

4.16

15.01

15.13

8

3.82

4.16

15.16

15.13

3.95

4.03

15.01

15.01

9

3.95

4.16

15.01

15.13

4.03

4.08

15.01

15.03

10

3.95

4.08

15.01

15.03

4.00

4.03

15.00

15.01



12.3.2 Polynomial Approximations: Quadratic Approximation

If we assume an objective function is unimodal and continuous inside an interval, then it can be approximated by a polynomial. A is the simplest interpolation of the objective function (see Ref. [3]). The quadratic approximationmethod consists of a sequence of reducing intervals, and iterative approximations in the reduced intervals. As the approximation approaches the actual minimum, its accuracy increases. When the error between the approximation and the actual function is less than a predefined tolerance, terminate the iteration.

A quadratic approximating function can be constructed given three consecutive points, x1, x2, and x3, and their corresponding function values, f(x1), f(x2), and f(x3). The quadratic function is expressed as follows.

(12.12)

At the three points, x1, x2, and x3, (x) = f(x). The three constants, c0, c1, and c2, can be determined as follows.

At the point x1, since f(x1) = (x1) = c0, the first constant is c0 = f(x1). At the point x2, f(x2) = (x2) = c0+c1(x2-x1). The second constant, c1, can be evaluated as follows.

(12.13)

At the point x3, f(x3) = (x3) = c0+c1(x3-x1)+c2(x3-x1)(x3-x2). The third constant, c2, is obtained from the following equation.

(12.14)

Using the above quadratic approximation, the minimum point, x*, should satisfy the first-order necessary conditions.

(12.15)

Then, the minimum point, x*, can be expressed as

(12.16)

The quadratic approximation can be implemented in successively reduced intervals. A successive quadratic approximation method is outlined as follows.

1.Define the tolerance of the function successive-values difference, γ > 0, and the tolerance of the variable, > 0. Set Δx. Set the iteration number, k, to 1.

2.Set the initial point, x11. Compute x21 = x 11 + Δx. Evaluate f(x11) and f(x 21). If f(x11) > f(x21), then x 31 = x21 + Δx. If f(x11) < f(x 21), then x31 = x11x. Evaluate f(x31).

3.Compare the function values at the three points, x1k, x2k, and x 3k. Find the minimum, fmink = min{f(x 1k),f(x 2k),f(x 3k)}, and the corresponding point, xmink.

4.Using the three points, x1k, x 2k, and x3k, construct a quadratic approximation. Compute the minimum point, xk*, using Eq. 12.16. Evaluate f(xk*).

5.If f(xk*) - fmink < γ and xk*-x mink < , take xk* as the minimum. Terminate the iteration.

6.Take the current best point xk* and the two points bracketing it as the three points for the next quadratic approximation. k = k + 1. Go to Step 3.

Example: Use the quadratic approximation method to solve the following optimization problem. The variable, x, is inside (1.001, 10).

(12.17)

Define the tolerance of the function successive-values difference as 0.01f(xk*) and the tolerance of the variable as 0.05xk*. Set Δx = 1. Set the initial point, x11 = 2.

Define δfk and δxk as

(12.18)

(12.19)

The results for all the iterations are listed in Table 12.3.

Table 12.3. The Iterations for the Quadratic Approximation Method

At the end of the second iteration, the errors satisfy the tolerances. The iterations are terminated. The optimal point is 2.65 and the optimal function value is 12.57

12.4 Multivariable Optimization

Most practical unconstrained nonlinear optimization problems have multiple design variables. In this section, the methods for solving multivariable nonlinear optimization problems are illustrated.

12.4.1 Zeroth-Order Methods

Simplex Search Method

The Simplex Search method is a direct-search method that does not require function derivatives (see Ref. [4]). Although the name of this method includes the word simplex, it has no relationship to the Simplex method of linear programming.

In geometry, a Simplex is a triangle in two-dimensional space and a tetrahedron in three-dimensional space. In n-dimensional space, it is an n-dimensional polytope. The main idea of the Simplex method is that, at each iteration, based on the comparison of function values at all the vertices, one vertex is projected through the centroid of the other vertices at a suitable distance. At the beginning, the Simplex Search method sets up a regular Simplex in the space of the variables. The objective function is evaluated at each vertex. For minimization problems, the vertex with the highest function value is the one that is reflected through the centroid of the other vertices to generate a new point. The point and the remaining vertices construct the next Simplex.

At the beginning of the algorithm, it requires the first Simplex to be generated given a base point, x0, and an appropriate scale factor, α. Assuming the variable vector has N dimensions, two increments, δ1 and δ2, are defined by the following two expressions.

(12.20)

(12.21)

Using the two increments, the jth dimension for the ith vertex can be calculated using the following expression.

(12.22)

At each iteration, the vertex with the highest function value is selected as xhigh. It is reflected through the centroid of the remaining points. The centroid is evaluated by the following expression.

(12.23)

The line passing through xhigh and xcentroid is given by the following expression.

(12.24)

The selection of λ can yield the desired point on the line. If λ = 0, it yields the point, xhigh. If λ = 1, the result is the point, xcentroid. In order to generate a symmetric reflection, λ is set as 2. The reflected point, xreflected, is evaluated as

(12.25)

The reflection process is illustrated in Fig. 12.5. The point, x2, has the highest function value, and it is reflected through the centroid of x1 and x3.

Figure 12.5. Reflection of the Vertex with the Highest Function Value

During the optimization, the following two situations may occur.

1.At the current iteration, the vertex with the highest function value is the reflected point generated in the last iteration. In this situation, choose instead the vertex with the next highest function value and generate a reflected point.

2.The iterations can cycle between two or more Simplexes. If a vertex remains in the Simplex for more than M iterations, set up a new Simplex with the lowest point as the base point and reduce the size of the Simplex. Thenumber of the dimension for the variable is N. The value of M can be estimated by M = 1.65N + 0.05N2 and M is rounded to the nearest integer.

The Simplex Search method is terminated when the size of the Simplex is sufficiently small or other termination criteria are met. The vertex with the lowest function value is taken as the minimum.

The Simplex construction and point reflection are illustrated in the following example.

Example: Use the Simplex Search method to minimize the following function.

(12.26)

The number of the dimensions is 2. Each Simplex has 3 vertices. Take x0 = [1, 1]T as the starting point, and set α = 2. The increments, δ1 and δ2, can be evaluated as follows.

(12.27)

(12.28)

The coordinates of the other two vertices are calculated as follows.

(12.29)

(12.30)

The function values at the three points, x0, x1, and x2, are as follows.

(12.31)

(12.32)

(12.33)

Since the vertex, x0, has the highest function value, it is reflected to form a new Simplex. The centroid of the two remaining points, x1 and x2, is calculated as follows.

(12.34)

The reflected point, xreflected, is calculated as follows.

(12.35)

At the point, xreflected, f(xreflected) = 2.3027. The vertex with the highest function value in this new Simplex is x1. The algorithm will continue by reflecting the x1, and the iterations proceed following the reflection rule until the optimal point, [2, 3]T, is found.

Pattern Search Methods

The pattern search methods are a family of derivative-free numerical optimization methods. They can be applied to not only the optimization of continuous differentiable functions, but also to the optimization of discrete or nondifferentiable functions. The pattern search methods look along certain specified directions, and evaluate the objective function at a given step length along each of these directions. These points form a frame around the currentiteration. Depending on whether any point within the pattern has a lower objective function value than the current point, the frame shrinks or expands in the next iteration. The search stops after a minimum pattern size is reached.

An important decision in the pattern search methods is to choose the search direction set, Dk, at each iteration k. A key condition is that at least one direction in this set should give a descent direction for the objective function whenever the current point is not a stationary point. If a search direction, p, satisfies the following inequality, it can be a direction of descent for the objective function.

(12.36)

where δ is a positive constant.

If the search direction, p, satisfies Eq. 12.36 at each iteration, it is a descent direction. Choose the search direction set, Dk, so that at least one direction, pDk, will satisfy cos θ > δ, regardless of the value of ∇fk.

A second condition on Dk is that the lengths of the vectors in this set are all similar. The diameter of the frame formed by this set is captured adequately by the step length. This condition can be expressed as

(12.37)

where βmin and βmax are positive constants.

For the pattern search methods in n-dimensional space, examples of sets Dk that satisfy the above two conditions in Eqs. 12.36 and 12.37 include the coordinate direction set

(12.38)

and the set of n+1 vectors defined by

(12.39)

where e = (1, 1,..., 1)T.

Another important decision in the pattern search method is how to choose a sufficient decrease function ρ(l). The sufficient decrease function is used to determine the convergence of the results. It is a function of the step length and its domain is [0,∞). It is an increasing function of l and the function values are positive. As l approaches 0, the limit of ρ(t)∕t is 0. An appropriate choice of the sufficient decrease function is Mt32, where M is some positiveconstant. If the decrease of the objective function value is less than the value of the sufficient decrease function, the pattern search is converged to the minimum.

At each iteration, the search direction set is Dk and the step length is lk. The frame that consists of the points at the next iteration is xk + lkpk for all pkDk. When one of the points in the frame yields a significant decrease of the objective function, it becomes the next searching point, xk+1, and the next step length, lk+1, is increased. If none of the points in the frame has a significantly smaller function value than fk, the next step length, lk+1, will be reduced.The procedure of the pattern search algorithm is illustrated as follows.

1.Define the sufficient decrease function ρ(l). Choose convergence tolerance ltol, contraction parameter θmax, and search direction set Dk. Choose initial point x0, initial step length l0, and initial direction set D0. Set the iterationnumber, k, to 1.

2.Evaluate the objective function value f(xk).

3.If the step length is lkltol, take xk as the optimal point, x*. Stop.

4.If f(xk+lkpk) < f(xk) - ρ(lk) for some pkDk, set xk+1 as xk+lkpk for some pkDk, and increase the step length for the next iteration.

5.If f(xk+lkpk) ≥ f(xk) - ρ(lk) for all pkDk, use the same point xk as the point xk+1 for the next iteration. Reduce the step length for the next iteration, lk+1. Set lk+1 as θklk, where 0 < θkθmax < 1.

6.Set k = k + 1. Go to Step 2.

MATLAB has the function, patternsearch, that finds the minimum of a function using the pattern search method. It can handle optimization problems with nonlinear, linear, and bound constraints, and does not require functions to be differentiable or continuous. An example is provided below to illustrate how to solve a nonlinear optimization problem without constraints using patternsearch.

Example: Use the pattern search method to solve the following three-dimensional quadratic problem.

(12.40)

(12.41)

(12.42)

The sufficient decrease function is set as ρ(l) = 10l32. The contraction factor is 0.5. The convergence tolerance is set as 0.001. The contraction factor, θmax, is set as 0.5. The initial step length, l0, is set as 0.5. The starting point, x0, is [0, 0, 0]T. The search directions set, Dk, is the coordinate direction set, which is as follows.

(12.43)

At the initial point, x0, the function value, f(x0), is 0. At the first iteration, the function values at x0 +l0pk, (k = 1,..., 6) are 9, 10.5, 10.5, -7, -7.5, and -5.5. The value of sufficient decrease function is as follows.

(12.44)

For the six points x0 + l0pk, three of them (k = 4, 5, 6) satisfy the equation, f(xk+lkpk) < f(xk)-(lk). Set the point, x0 + l0p5, as x1, and increase the step length. The iterations continue. The minimum point is obtained as follows.

(12.45)

The optimal function value f(x*) = -35.9.

This problem can also be solved by MATLAB function patternsearch. The result is the same. The main file is as follows.

clc
clear
X0 = [0; 0; 0];
[x,fopt]=patternsearch(@obj_fun,X0)

The file for the objective function is as follows.

function f = obj_fun(x)
Q = [2 0 0; 0 3 0; 0 0 5];
c = [-8; -9; -8];
f = 0.5*x’*Q*x-c’*x;

12.4.2 First-Order Methods

Steepest Descent

The steepest descent method is a first-order optimization algorithm. In each iteration, it takes the negative direction of the gradient as the descent direction of the objective function. This method takes steps along the descentdirection to find a local minimum. The procedure is as follows.

1.Choose a starting point x0. Set k = 0.

2.Check the conditions of f(xk). If xk is the minimum, stop.

3.Calculate the descent direction, pk, as follows.

(12.46)

4.Evaluate the step length, αk. The step length, αk, should be an acceptable solution to the following minimization problem.

(12.47)

5.Update the value of the variable, xk+1, using the descent direction and the step length.

(12.48)

6.Set k = k+1 and go to Step 2.

The steepest descent method is the simplest Newton-type method for nonlinear optimization. The search direction is the opposite of the gradient. It does not require the computation of second derivatives, and it does not requirematrix storage. The computational cost per iteration is lower compared to other Newton-type methods. However, this method is very inefficient at solving most problems. It converges only at a linear rate. This means it requires more iterations. As a result, although the computational cost per iteration is low, the overall costs are high. However, the steepest descent method is theoretically useful in proving the convergence of other methods; and it provides a lower bound on the performance of other algorithms.

The procedure to solve an unconstrained nonlinear problem using the steepest descent method is illustrated with the following example. Only the first two iterations are illustrated in this example.

Example: Use the steepest descent method to solve the following three-dimensional quadratic problem.

(12.49)

(12.50)

(12.51)

In this problem, the steepest descent direction is

(12.52)

The step length used for exact line search xk+1 = xk + αkpk is

(12.53)

Choose an initial point x0 = (0, 0, 0)T. At x0,

(12.54)

(12.55)

and

(12.56)

Since ∇f(x0) > 0, go to the next iteration. The step length is α0 = 0.1875. The next point is

(12.57)

At point x1, the function value, the gradient, and the norm of the gradient are as follows.

(12.58)

(12.59)

and

(12.60)

The next step length is α1 = 0.1715 and the new point is

(12.61)

At point x2, the function value, the gradient, and the norm of the gradient are as follows.

(12.62)

(12.63)

and

(12.64)

The above are the first two iterations to solve this problem. It takes 36 iterations before the norm of the gradient is less than 0.001. The optimal solution is

(12.65)

The corresponding optimal function value is f(x*) = -0.65.

Conjugate Gradient Methods

In addition to the Gaussian elimination method, iterative methods are also a valuable tool for solving linear equations in the form of Eq. 12.66. One of the many iterative methods is the conjugate gradient method, which is designed to solve Eq. 12.66 when the matrix A is symmetric and positive definite.

(12.66)

Equation 12.67 is an optimization problem that has a quadratic objective function.


(12.67)

The matrix A is symmetric and positive definite. The first-order necessary conditions require the gradient of f(x) to be equal to zero. The gradient can be expressed as Eq. 12.68.

(12.68)

Since the Hessian matrix, ∇2f(x) = A, is positive definite, the sufficient conditions for a local minimum are satisfied. This optimization problem is equivalent to the problem of linear equations as shown by Eq. 12.66.

If a set of vectors, pi, satisfies Eq. 12.69, the vectors are said to be conjugate with respect to the matrix A. Any set of conjugate vectors is also linearly independent.

(12.69)

The residual of Eq. 12.68 is

(12.70)

In Eq. 12.67, if the matrix A is diagonal, the contours of the function f(x) are ellipses. The axes of the ellipses are aligned with coordinate directions. The conjugate directions are along the coordinate directions. Figure 12.6 illustratesthe conjugate directions in a two dimensional space. The minimum of the function can be found by performing successive one-dimensional minimizations along the conjugate directions.

Figure 12.6. Conjugate Directions for a Diagonal Matrix A

In Eq. 12.67, if Matrix A is non-diagonal, the contours of the function f(x) are also ellipses. However, the axes of the ellipses are not aligned with the coordinate directions. The conjugate directions are not along the coordinate directions. Figure 12.7 shows the conjugate directions for a non-diagonal matrix in a two dimensional space.

Figure 12.7. Conjugate Directions for a Non-diagonal Matrix A

The conjugate gradient methods used to solve the quadratic optimization problem (Eq. 12.67) exist in different versions of the algorithm. The steps for one of these versions are illustrated as follows.

1.Set an initial guess x0. The residual is r0 = Ax0 - b. Set a vector with the same dimension of the conjugate vector, p0 = -r0. Specify the convergence tolerance, > 0. Initialize the iteration counter, i = 0.

2.Check the value of the residual ri. If ri < , stop.

3.Set αi = riTr i∕piTAp i.

4.Evaluate the next point, xi+1 = xi + αipi.

5.Evaluate the residual, ri+1 = ri + αiApi.

6.Set βi+1 = ri+1Tr i+1∕riTr i.

7.Set the conjugate vector, pi+1 = -ri+1+βi+1pi.

8.Set i = i + 1. Go to Step 2.

Example: Use the above algorithm for conjugate gradient methods to solve the following three-dimensional quadratic problem.

(12.71)

(12.72)

(12.73)

The convergence tolerance, , is set as 0.01. Choose an initial point x0 = (0, 0, 0)T. At this point, the residual is r0 = (8, 9, 8)T. This point is not optimal. The algorithm performs several iterations in search of the optimal solution.Table 12.4 provides the results for all the iterations.

Table 12.4. The Iterations for the Conjugate Gradient Method

At the initial point x0, r0 = 14.46. In the first iteration, r1 = 5.24. In the second iteration, r2 = 1.22. In the third iteration, r3 = 0. Since r3 ≤ , the point x3 = (-4.00,-3.00,-1.60)T is the optimal solution to this example, and the optimal function value is f(x3) = -35.9.

Different versions of the conjugate gradient method can be applied to quadratic optimization problems and general unconstrained nonlinear optimization problems. The Fletcher-Reeves method extends the conjugate gradient method to nonlinear optimization problems by making two changes. First, the one-dimensional minimum along each conjugate vector is identified by a line search. Second, the residual is replaced by the gradient of the nonlinear function. The steps of the Fletcher-Reeves method are as follows.

1.Set an initial guess x0. At x0, evaluate the function, f(x0), and its gradient, ∇f(x0). Set p0 = -∇f(x0). Specify the convergence tolerance of the residual, > 0. Set the iteration number, i, to 0.

2.Check the norm of the gradient ∇f(xi). If ∇f(xi) < , stop.

3.Use a line search along the direction of the conjugate vector to determine the minimum xi+1 = xi + αipi.

4.Evaluate the gradient ∇f(xi+1).

5.Set βi+1 = ∇f(xi+1)Tf(x i+1)f(xi)Tf(x i).

6.Set the conjugate vector pi+1 = -∇f(xi+1) + βi+1pi.

7.Set i = i + 1. Go to Step 2.

12.4.3 Second-Order Methods

Newton Method

Newton’s method approximates the function f(x) with a quadratic function at each iteration. The approximated quadratic function is minimized exactly, and a descent direction is evaluated.

In each iteration, the function is approximated as follows.

According to the first-order necessary condition, the first derivative of (x) should satisfy ∇ (x) = 0. This condition can be expressed as

(12.75)

Solving the above equation,

(12.76)

Assuming the inverse of the Hessian exists, then

(12.77)

The second term in Eq. 12.77 is used as the reasonable direction to estimate xk+1 in the next iteration.

(12.78)

The descent direction, pk, can be expressed as

(12.79)

If [∇2f(x k)]-1 exists, pk can be evaluated using Eq. 12.79. The other way to evaluate the descent direction, pk, is by solving the following equation.

(12.80)

Equation 12.80 can be solved by Gaussian Elimination, Cholesky Factorization, or other suitable methods.

The expression of the descent direction is simple in Newton’s method. However, Newton’s method has certain shortcomings. (1) Newton’s method does not always find a minimum, and might only find a stationary (not minimum) point. (2) It converges only if one starts sufficiently near an optimal solution. (3) The computational cost to evaluate the inverse of the Hessian is high, especially for high-dimensional optimization problems.

The steps to solve an unconstrained nonlinear problem are illustrated with the following example.

Example: Use Newton’s method to minimize the following function.

(12.81)

The gradient of f(x) is as follows.

(12.82)

The Hessian of f(x) is as follows.

(12.83)

The inverse of ∇2f(x) is as follows

(12.84)

Choose an initial point, x0 = (10, 10)T. The next point, x1, is evaluated as follows.

(12.85)

The point x1 = (0, 0)T is the optimal solution for this example, and the optimal function value is f(x1) = 0.

Quasi-Newton Methods

Quasi-Newton methods are among the most widely used methods for nonlinear optimization. When the Hessian is difficult to compute, quasi-Newton methods are effective for solving these kinds of problems. There are differentquasi-Newton methods, but they are all based on approximations of the Hessian by another matrix to reduce the computational cost. In Sec. 12.4.3, Newton’s method estimates the search direction, pk, that satisfies the following equation.

(12.86)

Quasi-Newton methods use an approximation of the Hessian, Bk, to obtain the search direction, pk. The search direction is obtained by solving

(12.87)

For one-dimensional nonlinear optimization problems, quasi-Newton methods are generalizations of the secant method. The secant method approximates the Hessian using

(12.88)

where f′(xk) can be approximated as

(12.89)

For multidimensional nonlinear optimization problems, the above condition is rewritten as

(12.90)

From the above Eq. 12.90, the condition for the approximation matrix Bk is derived, which can be expressed as

(12.91)

The matrix Bk has n2 entries and the secant condition only has a set of n equations. This condition is insufficient to define Bk uniquely. The matrix Bk does not always have a unique solution for the secant condition.

There are several ways to approximate and update the matrix Bk. Define two vectors, sk and yk, as follows.

sk

= xk+1 - xk

(12.92)

yk

= ∇f(xk+1) -∇f(xk)

(12.93)

As one of the approximation formulae, the symmetric rank-one update formula can be expressed as

(12.94)

The Bk’s updated by Eq. 12.94 are symmetric. However, they are not necessarily positive definite.

As one of the most widely used formulae, the BFGS update formula can be expressed as

(12.95)

Upon selecting the appropriate update formula of Bk, the quasi-Newton algorithm is illustrated as follows.

1.Choose a starting point x0. Choose an initial Hessian approximation B0, which can be the diagonal matrix, I. Set k = 0.

2.If xk is optimal, stop.

3.Determine the search direction, pk, by solving Bkpk = -∇f(xk).

4.Find the step length, αk, by a line search to minimize f(xk+ αkpk).

5.Set xk+1 = xk + αkpk.

6.Update Bk+1 using the selected update formula.

7.Set k = k + 1 and go to Step 2.

The steps to solve an unconstrained nonlinear problem are illustrated with the following example.

Example: Use a quasi-Newton method with the symmetric rank-one Bk update formula to solve the following three-dimensional quadratic problem.

(12.96)

(12.97)

(12.98)

Choose an initial point x0 = (0, 0, 0)T. The initial guess is B0 = I. At this point, x0, ∇f(x0) = -c = 14.4568, so this point is not optimal.

From B0p = -∇f(x0), the first search direction is derived as

(12.99)

The line search formula gives α0 = 0.3025. The new estimated point is

(12.100)

The gradient at the new point x1 is

(12.101)

To update B1, first evaluate s0 and y0.

(12.102)

(12.103)

(12.104)

At this new point x1, ∇f(x1) = 5.2423. It is not optimal. The search direction is

(12.105)

The line search step length is α1 = 0.3471. At the new point, x2, the new estimates of the solution, the gradient, and the Hessian are

(12.106)

(12.107)

(12.108)

At the point x2, ∇f(x2) = 0.8397. It is not optimal. The new search direction is

(12.109)

and the step length α2 = 0.4145. The line search yields the point

(12.110)

At point x3,

(12.111)

The point x3 is the optimal solution to this example, and the optimal function value is f(x3) = -35.9.

12.5 Comparison of Computational Issues in the Algorithms

12.5.1 Rate of Convergence

The optimization algorithms discussed in this chapter are iterative methods, and are used to evaluate a sequence of approximate solutions. The rate of convergence is a measure of efficiency, which describes how quickly the estimates of the solution approach the exact solution. Efficient algorithms reduce the computational cost. Assume a sequence of points xk converges to a solution x*. The error at the kth iteration can be defined as

(12.112)

As the sequence approaches the solution x*, the limit of e k is 0.

If Eq. 12.113 holds, the sequence xk is said to converge to x* with rate r and rate constant C. The constant C < ∞.

(12.113)

For a sequence of errors, when r = 1, the convergence is said to be linear, as Eq. 12.114 indicates. Note that if 0 < C < 1, then the norm of the error is reduced at every iteration. If C > 1, then the sequence diverges.

(12.114)

If r = 1 and C = 0, the rate of the convergence is said to be superlinear. For any r > 1, if

(12.115)

then Eq. 12.116 holds. Any convergence with r > 1 has a superlinear rate of convergence.

(12.116)

When r > 2, the convergence is called quadratic. The rates of convergence for some optimization algorithms in this chapter are listed in Table 12.5.

Table 12.5. Comparison of Optimization Methods for Unconstrained Nonlinear Problems



Method

Converge

Convergence

Computation

Problem

Proof

Rate

Cost

Scale


Bisection

yes

linear

very high

1-dimensional

Golden Section

yes

linear

high

1-dimensional

Quadratic Appro

yes

linear

very high

1-dimensional

Pattern Search

yes

high

small

Steepest Descent

yes

linear

high

medium

Conjugate Grad

yes

linear-quadratic

medium high

large

Newton

no

quadratic

medium - high

small-medium

Quasi-Newton

yes

superlinear

low - medium

small - mid



12.5.2 Line Search Methods

In the above sections, at each iteration, the steepest descent method, the conjugate gradient method, the Newton method and the quasi-Newton methods first evaluate a search direction pk, and then do a line search along the direction to determine an appropriate step length, αk. The iteration is given by

(12.117)

Most line search algorithms require the search direction, pk, to be a descent direction, and it should satisfy

(12.118)

The step length, αk, should generate a substantive reduction of f(x) along the descent direction. The minimum of Eq. 12.119 is chosen as a function of αk.

(12.119)

In general, it is not computationally practical to determine that exact minimum. Some practical strategies perform an inexact line search to identify a step length with adequate reduction in f(x). These strategies reduce computational cost.

A popular inexact line search condition requires that αk give a sufficient decrease in the objective function, f(x), as measured by the following inequality:

(12.120)

where the constant c1 ∈ (0, 1). This inequality is called the Armijo condition.

The sufficient-decrease condition is not, by itself, adequate to ensure that the algorithm makes reasonable progress. To rule out unacceptable short steps, the curvature condition is introduced as the following inequality.

(12.121)

where the constant c2 ∈ (c1, 1). The sufficient-decrease condition and the curvature condition are collectively known as the Wolfe conditions.

12.5.3 Comparison of Different Methods

The optimization methods presented in this chapter include the Bisection method, the Golden Section Search method, the quadratic approximation method, the pattern search methods, the steepest descent method, the conjugategradient methods, the Newton method, and the quasi-Newton methods.

All of these methods have applications in different fields. Unconstrained nonlinear problems can be classified as small, mid, or large scale problems. Computers used to run the codes of the algorithms can have different computational capacity. The appropriate method can be selected, depending on the application, the computation capability, and the storage requirements. Note that computation capacity, in this regard, and storage requirements have become less critical issues in recent years.

Table 12.5 provides a comparison of the different methods. The comparison includes the following properties: convergence guarantee, convergence rate, computation cost, and problem scale. It is important to keep in mind that the overall computational cost is a combination of (i) the number of iterations, and (ii) the cost of each iteration. While a method may converge in fewer iterations, it might require more costly computation at each iteration. For example,the Newton method may often converge with fewer iterations than the Conjugate Gradient method. However, at each iteration, Conjugate Gradient only requires gradient computation, while Newton requires a very costly Hessiancomputation.

The Conjugate Gradient method may exhibit linear to quadratic convergence, depending on the particular implementation. As standalone algorithms, Bisection, Golden Section, and Quadratic Approximation methods are capableof performing only univariate optimization or line search. However, they can be applied as techniques to perform line search along a direction of improvement (direction vector) within the context of multivariate optimization. We note that some of the above comments are somewhat subjective, and that much depends on the particular problem and the computing resources at hand.

This chapter provides the reader with sufficient information to understand the basics of unconstrained nonlinear programming. This background makes it possible to explore more theoretical aspects of the topic [5].

12.6 Summary

Since most engineering design problems tend to be nonlinear in nature, the topic of nonlinear programming (NLP) is of paramount importance in learning optimization. This chapter introduced the basics of unconstrained NLP. The chapter began with a description of the necessary and sufficient conditions for optimality in unconstrained NLP problems. This description was followed by the introduction of the major techniques/algorithms used for solving uni-vartiate and multivariate NLP problems. Specifically, two major classes of uni-variate methods were presented: (i) interval reduction methods and (ii) polynomial approximation methods. This was followed by a description of thethree major classes of multivariate unconstrained NLP methods. These are: the Zeroth order methods (e.g., line search and pattern search), the first order methods (e.g., steepest descent and conjugate gradient), and the second ordermethods (e.g., Newton’s and Quasi Newton methods). The chapter concluded with a comparative discussion of the rate of convergence and the computational cost of these different classes of methods for solving unconstrained NLP problems.

12.7 Problems

Warm-up Problems

12.1Consider the function:

1.Calculate the Hessian matrix of the function.

2.Calculate the Hessian at (i)[1, 1], (ii)[2, 2], and (iii)[0, 0].

3.Use optimality conditions to classify the points in Question (2) as minimum, maximum, or saddle points.

Intermediate Problems

12.2Consider the following two dimensional function:

1.Calculate the gradient vector of the above function by hand.

2.Calculate the gradient at the following points by hand: (i)[1, 1], (ii)[2, 2], and (iii)[0, 0].

3.Write a small program in MATLAB that evaluates the gradient at each point in a two-dimensional grid in the space -5 ≤ x1 ≤ 5. Choose an appropriate grid spacing (at least 100 points).

4.Find a way in MATLAB to plot each gradient vector of Question (3) as a small arrow (pointing in the correct direction) at its x1, x2 location; that is, if you evaluated 1, 000 gradient vectors in Question 3, then your plot should contain 1, 000 arrows. In addition, plot the contours of f on top of the gradient plot.

5.Interpret your plot and comment on the likely location of the minimum point by observing the gradient vectors and the contours of f.

12.3Consider the function:

1.Write a program to estimate the minimum of the function using the steepest descent method. Use the starting point [10, 1].

2.Compare your optimum results with the results of fmincon. How many function evaluations does your code need? How many function evaluations does fmincon need?

3.On the contour plot of the function, plot the sequence of points obtained during the optimization iterations using your steepest descent code.

Advanced Problems

12.4Consider the following function:

We wish to find the minimum of the above function in the given range of x.

1.Using the following listed methods, estimate the minimum of the above function using up to three iterations. Solve Parts (a) through (c) by hand, and please show your complete work.

(a)Interval Halving Method (i.e., Bisection method)

(b)Newton-Raphson Method using x0 = 1 as the starting point

(c)Golden Section Method

(d)Plot the function f(x) using MATLAB, and identify the optimum solution in the range of x. For Parts (a) through (c), calculate the percentage error between the optimum x value and the value of x after three iterations.Present your results in a tabular form, as shown below, for methods in Parts (a), (b), and (c).

2.Write a MATLAB code for the Newton-Raphson method that evaluates the minimum of the function in Problem 1. Use an appropriate stopping criterion.

3.Verify your results using fmincon. Make sure your code gives you the same solution as that given by fmincon. You might have to change your stopping criterion.

4.How many function evaluations does your code need? How many function evaluations does fmincon need? Use the same starting point for your code and fmincon.

Table 12.6. Results Summary for Given Search Methods



Iteration

(a)

(b)

(c)







x

f(x)

x

f(x)

x

f(x)


1

2

3



12.5Consider the following function:

We want to find the minimum of the above function in the given range of x.

1.Using the following methods, estimate the minimum of the above function by implementing up to three iterations. Solve Parts (a) through (c) by hand, and please show your complete work.

(a)Bisection Method

(b)Secant Method

(c)Successive quadratic estimation Present your results in a tabular form, as shown below, for methods in Parts (a), (b), and (c).

2.Write a MATLAB code for the above three methods for the given function. Use an appropriate stopping criterion.

Table 12.7. Results Summary for Given Search Methods



Iteration

(a)

(b)

(c)







x

f(x)

x

f(x)

x

f(x)


1

2

3



BIBLIOGRAPHY OF CHAPTER 12

[1]M. S. Bazaraa, H. D. Sherali, and C. M. Shetty. Nonlinear Programming: Theory and Algorithms. John Wiley and Sons, 3rd edition, 2013.

[2]A. Ruszczynski. Nonlinear Optimization. Princeton University Press, 2011.

[3]B. Zwicknagl and R. Schaback. Interpolation and approximation in taylor spaces. Journal of Approximation Theory, 171:65-83, 2013.

[4]A. R. Conn, K. Scheinberg, and L. N. Vicente. Introduction to Derivative-free Optimization. SIAM, 2009.

[5]R. J. Vanderbei. Linear Programing: Foundations and Extensions. Springer, 2014.