8  Some Theory

In this module we have concentrated on experimentation and discovery, not so much on theoretical development. However there are some theoretical concepts that are so central to Numerical Analysis that they deserve to be presented in a more structured way. In this section we will introduce some of these concepts.

8.1 Order of Convergence of Iterative Methods

We have met several methods that used fixed-point iteration, in particular Newton’s method for finding roots and the gradient descent method for finding minima. In each case we made log-log plots of the absolute error at the (k+1)st iteration against the absolute error at the k-th iteration. From the slopes of the resulting graphs we were able to determine the order of convergence of the method.

The fixed point iteration takes the form \[ x_{n+1}=g(n_k). \tag{8.1}\] When \(g\) is suitably chosen, the method will converge to a fixed point \(x^*\) of \(g\) (such that \(g(x^*)=x^*\)) and this fixed point is the desired solution to the problem.

The absolute error at step the n-th iteration is \[ E_n=|x_n-x^*|. \tag{8.2}\] We are interested in how \(E_{n+1}\) is related to \(E_n\). We use the following Taylor expansion of \(g\) around \(x^*\): \[\begin{split} x_{n+1}=g(x_n)=g(x^*)&+g'(x^*)(x_n-x^*)\\ &+\frac{g''(x^*)}{2}(x_n-x^*)^2+\cdots\\ &+\frac{g^{(p)}(\xi)}{p!}(x_n-x^*)^p \end{split} \tag{8.3}\] for some \(\xi\) between \(x^*\) and \(x_n\).

8.1.1 Linear convergence

In the case where \(g'(x^*)\neq 0\), this shows that \[ \begin{split} E_{n+1}&=|x_{n+1}-x^*|=|g'(x^*)||x_n-x^*|+\mathcal{O}(x_n-x^*)^2\\ &=|g'(x^*)|E_n+\mathcal{O}(x_n-x^*)^2. \end{split} \tag{8.4}\] So as \(x_n\to x^*\), \[ \frac{E_{n+1}}{E_n}\to |g'(x^*)|. \tag{8.5}\] The error decreases by a factor of \(|g'(x^*)|\) at each iteration in the long run. If \(|g'(x^*)|<1\) then the method is said to have linear convergence or 1st order convergence.

Example 8.1 (Order of convergence of gradient descent) For the gradient descent method we have \[ g(x)=x-\alpha f'(x), \tag{8.6}\] where \(\alpha\) is the learning rate. The fixed point of this method is a minimum of \(f\). The derivative of \(g\) is \[ g'(x)=1-\alpha f''(x). \tag{8.7}\] Because this is non-zero, the method is only first-order convergent. The learning rate \(\alpha\) must be chosen carefully to ensure convergence.

8.1.2 Quadratic convergence

In the case where \(g'(x^*)=0\) but \(g''(x^*)\neq 0\), we get from the Taylor expansion in Eq. 8.3 that \[ \begin{split} E_{n+1}&=\frac{|g''(x^*)|}{2}|x_n-x^*|^2+\mathcal{O}(x_n-x^*)^3\\ &=\frac{|g''(x^*)|}{2}E_n^2+\mathcal{O}(x_n-x^*)^3. \end{split} \] So in the limit as \(x_n\to x^*\), \[ \frac{E_{n+1}}{E_n^2}\to \frac{|g''(x^*)|}{2}. \]

Example 8.2 (Order of convergence of Newton’s method) For Newton’s method we have \[ g(x)=x-\frac{f(x)}{f'(x)}. \tag{8.8}\] The fixed point of this method is a root of \(f\). The derivative of \(g\) is \[ g'(x)=1-\frac{f'(x)f'(x)-f(x)f''(x)}{f'(x)^2}=\frac{f(x)f''(x)}{f'(x)^2}. \tag{8.9}\] At the fixed point \(x^*\) we have \(f(x^*)=0\), so \(g'(x^*)=0\). The second derivative of \(g\) is \[ g''(x)=\frac{f(x)f'(x)f''(x)+f(x)^2f'''(x)-2f(x)f''(x)^2}{f'(x)^3}. \tag{8.10}\] This does generally not vanish at the fixed point. Therefore Newton’s method is second-order convergent.

8.1.3 p-th order of convergence

In general, if \(g'(x^*)=\cdots=g^{(p-1)}(x^*)=0\) but \(g^{(p)}(x^*)\neq 0\), then the method is said to have p-th order convergence. The error at the n-th iteration is then \[ E_{n+1}=\frac{|g^{(p)}(x^*)|}{p!}E_n^p+\mathcal{O}((x_n-x^*)^{p+1}). \] So in the limit as \(x_n\to x^*\), \[ \frac{E_{n+1}}{E_n^p}\to \frac{|g^{(p)}(x^*)|}{p!}. \] Generally, if \[ \lim_{n\to\infty}\frac{E_{n+1}}{E_n^p}=c \] for some constant \(c\) then the method is said to have p-th order convergence.

To make the connection with our log-log error plots, we observe that \[ \log E_{n+1}=\log\left(\frac{E_{n+1}}{E_n^p}\right)+p\log E_n. \] Because \(E_{n+1}/E_n^p\to c\), if the method has p-th order convergence, then the slope of the log-log plot of the error at the (n+1)st iteration against the error at the n-th iteration will be \(p\) and the intercept will be \(\log c\).

There is a recording of my discussion of this material in a class session.

8.2 Truncation Errors

Many of the approximation methods in Numerical Analysis are based on Taylor series expansions and then dropping higher-order terms in the series. The error that is introduced by truncating a Taylor series in this way is called the truncation error.

Let us recall Taylor’s formula for a function \(f\) that is \(p\) times differentiable at a point \(x\): \[\begin{multline} f(x)=f(x_0)+f'(x_0)(x-x_0)+\frac{f''(x_0)}{2!}(x-x_0)^2+\cdots\\+\frac{f^{(p-1)}(x_0)}{(p-1)!}(x-x_0)^{p-1}+\frac{f^{(p)}(\xi)}{p!}(x-x_0)^{p} \end{multline}\] for some \(\xi\) between \(x\) and \(x_0\). The error introduced by truncating the series after the \(p\)-th term is \[ |R_p(x)|=\left|\frac{f^{(p)}(\xi)}{p!}(x-x_0)^{p}\right|. \] We had first discussed this in Section 2.4.

8.2.1 Truncation error in finite-difference formulae

In the finite-difference formulae for derivatives that we discussed in Section 4.2, we approximated the derivative of a function \(f\) at a point \(x\) by a linear combination of function values at points close to \(x\). For example, the forward difference formula for the first derivative is \[\begin{split} f'(x)&\approx \frac{f(x+h)-f(x)}{h}\\ &=\frac{f(x)+hf'(x)+\frac{h^2}{2}f''(x)+\cdots-f(x)}{h}\\ &=f'(x)+\frac{h}{2}f''(\xi). \end{split}\] for some \(\xi\) between \(x\) and \(x+h\). We see that the truncation error is proportional to the stepsize \(h\). We say that the trunction error is of order \(h\) and that this forward-difference approximation is a first-order approximation. This means that in order to halve the size of the error we need to halve the stepsize.

You can now do a similar analysis for the other finite-difference formulae that we discussed in Section 4.2.

8.2.2 Truncation error in numerical integration

In numerical integration we approximate the integral of a function \(f\) over an interval \([a,b]\) by splitting the interval into many subintervals. So we introduce a grid of points \(x_j=x_0+jh\) for \(j=0,\ldots,N\) with \(x_0=a\) and \(x_N=b\) and \(h=(b-a)/N\) and write the integral as a sum \[ \int_a^b f(x)dx=\sum_{j=1}^{N}\int_{x_{j-1}}^{x_{j}}f(x)dx. \]

The simplest numerical integration method is the Riemann sum, where we approximate the function on each subinterval by the value of the function at the endpoint of the interval. So for the right Riemann sum we have \[ I_j=\int_{x_{j-1}}^{x_{j}}f(x)dx\approx \int_{x_{j-1}}^{x_{j}}f(x_j)dx=f(x_j)h=A_j. \] The local truncation error is defined as the absolute error made in each subinterval, divided by the width of the subinterval: \[ \tau_j=\frac{1}{h}|I_j-A_j|. \] For the right Riemann sum we have \[\begin{split} I_j-A_j&=\int_{x_{j-1}}^{x_{j}}\left(f(x)-f(x_j)\right)dx\\ &=\int_{x_{j-1}}^{x_{j}}\left(f(x_j)+f'(\xi_j)(x-x_j)-f(x_j)\right)dx\\ &=f'(\xi_j)\int_{x_{j-1}}^{x_{j}}(x-x_j)dx\\ &=f'(\xi_j)\frac{h^2}{2}\\ \end{split}\] for some \(\xi_j\) between \(x_{j-1}\) and \(x_j\). So the local truncation error is \[ \tau_j=\frac{1}{h}\left|f'(\xi_j)\frac{h^2}{2}\right|=\frac{h}{2}|f'(\xi_j)|. \] We see that the local truncation error is of order \(h\) and that the right Riemann sum is a first-order approximation. The reason for dividing by \(h\) in the definition of the local truncation error is that the power of \(h\) in the local truncation error is the same as in the global truncation error \(\tau\) is obtained by summing over the errors from all subintervals. For the Riemann sum we have for some \(\xi\) between \(a\) and \(b\) that \[\begin{split} \tau&=\left|\sum_{j=1}^{N}\frac{h^2}2f'(\xi_j)\right| \leq \left|\sum_{j=1}^{N}\frac{h^2}2f'(\xi)\right|\\ &\leq \sum_{j=1}^{N}\left|\frac{h^2}2f'(\xi)\right| =N\frac{h^2}2|f'(\xi)|\\ &=\frac{h}{2}|f'(\xi_j)|. \end{split}\]

You can now do a similar analysis for the trapezoidal rule and Simpson’s rule.

There is a recording of my discussion of this material in a class session.

8.3 Stiff Equations and Stability

8.3.1 Stiff equations

A differential equation is called “stiff” if the exact solution has a term that decays very rapidly. This is a problem for numerical methods because the step size must be very small to capture the rapid decay. This is a problem because the step size must be small for the entire solution, not just the rapidly decaying part. This can make the solution very slow to compute.

An example of such a rapidly decaying term would be a term of the form \(e^{-\gamma t}\) with \(\gamma\) large. As an example consider the ODE \[ x'=-10x+\sin t. \] This is similar to equations we solved in Chapter 6 just with a larger negative coefficient in front of \(x\). The solution to this equation is \[ x(t)=Ce^{-100t}+\frac{100}{10001}\sin t-\frac{1}{10001}\cos t. \] The term \(e^{-100t}\) decays very rapidly. We refer to such a term in the solution as a “transient” because it becomes irrelevant after a short time.

8.3.2 Stability of solution methods

To test the stability of a method we can apply it to a simple ODE with a known solution. We can then compare the numerical solution to the exact solution. If the numerical solution is stable, then it will converge to the exact solution as the step size goes to zero. If the numerical solution is unstable, then it will diverge from the exact solution as the step size goes to zero.

The test equation that is always used for this purpose of studying the stability of a numerical method is the linear ODE \[ x'=\lambda x. \] The exact solution to this equation is \(x(t)=x_(0)_0e^{\lambda t}\). So for negative \(\lambda\) the solution decays, and for positive \(\lambda\) the solution grows. The stability of a numerical method can be tested by applying it to this equation and comparing the numerical solution to the exact solution.

Example 8.3 (Stability of Euler’s method) Euler’s method uses the formula \(x_{n+1}=x_n+h f(t_n,x_n)\). Applying this to the test equation \(x'=\lambda x\) gives \[ x_{n+1}=x_n+h\lambda x_n=(1+h\lambda)x_n. \] So \[ x_{n+1}=(1+h\lambda)x_n=(1+h\lambda)^2x_{n-1}=\cdots=(1+h\lambda)^{n+1}x_0. \] Because we know the exact solution, we can calculate the absolute error that Euler’s method produces: \[ E_n:=|x_n-x(t_n)|=|x_0(1+h\lambda)^n-x_0e^{\lambda t_n}|. \] In the case where \(\lambda<0\) we have that the exact solution decays to zero as \(n\to\infty\). So the error will be determined by the behaviour of the first term: \[ \begin{split} \lim_{n\to\infty}E_n&=\lim_{n\to\infty}|x_0(1+h\lambda)^n|\\ &=\begin{cases} 0&\text{if }|1+h\lambda|<1\\ \infty&\text{if }|1+h\lambda|>1. \end{cases} \end{split} \] The error grows exponentially unless \(|1+h\lambda|<1\). This is the stability condition for Euler’s method. It requires us to choose a step size \[ h<\frac{2}{|\lambda|} \]

In general, for a method that, when applied to the test equation \(x'=\lambda x\), gives \[ x_{n+1}=Q(h\lambda)x_n \] the stability condition is that \[|Q(h\lambda)|<1.\] Values of \(h\lambda =: z\) for which \(|Q(z)|<1\) form the *Region of absolute stability** of the method. This is a region in the complex plane, because while the stepsize \(h\) is of course always real, in applications the transient term may be oscillatory, corresponding to a complex \(\lambda\).

::: {#exm-midpoint_stability} #### Stability of the midpoint method The midpoint method uses the formula \[ x_{n+1}=x_n+hf(t_n+\frac{h}{2},x_n+\frac{h}{2}f(t_n,x_n)). \] Applying this to the test equation \(x'=\lambda x\) gives \[ x_{n+1}=x_n+h\lambda \left(x_n+\frac{h}{2}\lambda x_n\right) =x_n\left(1+h\lambda+\frac{(h\lambda)^2}{2}\right). \]

The region of absolute stability is the set of \(z\in\mathbb{C}\) for which \[ \left|1+z+\frac{z^2}{2}\right|<1. \]

8.3.3 Implicit methods

Implicit methods tend to have a larger region of absolute stability, often including the entire left half plane, in which case they are called “unconditionally stable”. Such methods are of course particularly useful for stiff equations because they do not need a particular small step size to remain stable.

We demonstrate this in the case of the backward Euler method, which uses the formula \[ x_{n+1}=x_n+hf(t_{n+1},x_{n+1}). \] Applying this to the test equation \(x'=\lambda x\) gives \[ x_{n+1}=x_n+h\lambda x_{n+1}. \] So \[ x_{n+1}=\frac{x_n}{1-h\lambda}. \] The region of absolute stability is the set of \(z\in\mathbb{C}\) for which \[ \left|\frac{1}{1-z}\right|<1, \] or, equivalently, \[ |z-1|>1. \] This is almost the entire complex plane, except for the circle of radius 1 around the point \(z=1\). In particular it contains the entire left half plane, so stiff equations do not require any particular choice of step size for the backward Euler method to be stable.

8.4 Stability of finite-difference methods for the heat equation

While exploring the explicit finite-difference method for solving the 1d heat equation, we encountered the stability condition \[ a=\frac{D\Delta t}{(\Delta x)^2}\leq \frac{1}{2}. \] We now want to understand where this condition comes from.

We start by setting up the notation for solving the heat equation \(u_t=Du_xx\) using the explicit finite-difference method. We discretize the spatial variable \(x\) into \(N\) points \(x_0,\ldots,x_N\) with \(\Delta x=(x_N-x_0)/N\). We discretize the time variable \(t\) into \(M\) points \(t_0,\ldots,t_M\) with \(\Delta t=(t_M-t_0)/M\). We denote the approximation of \(u(t_n,x_i)\) by \(U_i^n\). The initial condition sets \(U_i^0=u(t_0,x_i)\). We work with homogeneous Dirichlet boundary conditions, so \(U_0^n=U_N^n=0\) for all \(n\).

The approximations at the remaining points is then calculated by the formula \[ U_i^{n+1}=U_i^n+a(U_{i+1}^n-2U_i^n+U_{i-1}^n) \] for \(i=1,\ldots,N-1\). You derived this in Section 7.2.1 using the finite-difference formulae for the derivatives from Section 4.2. We rewrite this in matrix notation: \[ \begin{pmatrix}U_1^{n+1}\\\vdots\\U_{N-1}^{n+1}\end{pmatrix}=\begin{pmatrix}1-2a&a&&\\ a&1-2a&a&&\\ &\ddots&\ddots&\ddots&\\ &&a&1-2a\end{pmatrix}\begin{pmatrix}U_1^{n}\\\vdots\\U_{N-1}^{n}\end{pmatrix}. \] The matrix is tridiagonal, with \(1-2a\) on the diagonal and \(a\) on the two off-diagonals. We denote the matrix by \(A\) and the two vectors by \(\mathbf{U}_{n+1}\) and \(\mathbf{U}_{n}\) respectively. So we have the formula \[ \mathbf{U}_{n+1}=A\mathbf{U}_n. \] The solution at time \(t_n\) is then given by \[ \mathbf{U}_n=A^n\mathbf{U}_0. \]

We want to understand how errors evolve over time. If errors grow exponentially over time then we call the method unstable and the method is not useful.

Let us assume that at some step, for convenience let us choose step \(0\), an error \(\mathbf{\epsilon}\) is introduced: \[ \tilde{\mathbf{U}}_0=\mathbf{U}_0+\mathbf{\epsilon}. \] Then after \(n\) steps we have \[ \tilde{\mathbf{U}}_n=A^n\tilde{\mathbf{U}}_0=A^n(\mathbf{U}_0+\mathbf{\epsilon})=A^n\mathbf{U}_0+A^n\mathbf{\epsilon}=\mathbf{U}_n+A^n\mathbf{\epsilon}. \] So the error at time \(t_n\) is \[ \mathbf{\epsilon}^n=A^n\mathbf{\epsilon}. \] To see if the error grows exponentially over time, we expand the initial error in terms of the eigenvectors of the matrix \(A\): \[ \mathbf{\epsilon}=\sum_i\epsilon_i\mathbf{v}_i, \] where \[ A\mathbf{v}_i=\lambda_i\mathbf{v}_i \] and the sum is over all eigenvectors \(\mathbf{v}_i\) of \(A\). Then \[ \mathbf{\epsilon}^n=\sum_i\epsilon_i\lambda_i^n\mathbf{v}_i. \] This shows that if all eigenvalues have absolute value less than 1, then the method is stable. If at least one eigenvalue has absolute value greater than 1, then the corresponding component of the error will grow exponentially with time and the method is unstable.

The eigenvalues of the matrix \(A\) are the roots of the characteristic polynomial \[ \det(A-\lambda I)=0. \]

There is a nice method to determine the eigenvalues of the matrix \(A\), using Fourier analysis. We will not discuss this in this module and instead will just look at the simple example.

Example 8.4 Consider the heat equation on the spatial domain \(x\in[0,1]\) and divide this into three subintervals, so that our spatial grid consists of \(x_0=0,x_1=1/3,x_2=2/3\) and \(x_3=1\). The matrix \(A\) is then \[ A=\begin{pmatrix}1-2a&a&\\ a&1-2a \end{pmatrix}. \] The characteristic polynomial is \[ \begin{split} \det(A-\lambda I)&=\begin{vmatrix}1-2a-\lambda&a\\ a&1-2a-\lambda \end{vmatrix}\\ &=(1-2a-\lambda)^2-a^2\\ &=\lambda^2-2(1-2a)\lambda+(1-2a)^2-a^2. \end{split} \] The roots of this polynomial are \[ \lambda_\pm=1-2a\pm a. \] We need both of these to have a magnitude less than 1 for the method to be stable. This gives us an upper bound on the allowd \(a\). The eigenvalue whose magnitude will increase above \(1\) first as \(a\) increases is \(\lambda_-=1-3a\), which has magnitude \(1\) when \(a=2/3\). So the stability condition is \(a\leq 2/3\).

In case you are wondering why this does not agree with the stability condition \(a<1/2\) that we found experimentally in Section 7.2.1, this is because as the dimension of the matrix \(A\) increases, the eigenvalues move further away from the origin. This is a general feature of matrices, and is why the stability condition is more restrictive for larger matrices.