Add an article on interfacing with automatic differentiation. Doing this also necessitated some re-organization of the derivatives article into chapters and some minor edits. Change-Id: Ic08e83af138817173caa80a52a9e72707cd57512
diff --git a/docs/source/analytical_derivatives.rst b/docs/source/analytical_derivatives.rst new file mode 100644 index 0000000..cfd5028 --- /dev/null +++ b/docs/source/analytical_derivatives.rst
@@ -0,0 +1,192 @@ +.. default-domain:: cpp + +.. cpp:namespace:: ceres + +.. _chapter-analytical_derivatives: + +==================== +Analytic Derivatives +==================== + +Consider the problem of fitting the following curve (`Rat43 +<http://www.itl.nist.gov/div898/strd/nls/data/ratkowsky3.shtml>`_) to +data: + +.. math:: + y = \frac{b_1}{(1+e^{b_2-b_3x})^{1/b_4}} + +That is, given some data :math:`\{x_i, y_i\},\ \forall i=1,... ,n`, +determine parameters :math:`b_1, b_2, b_3` and :math:`b_4` that best +fit this data. + +Which can be stated as the problem of finding the +values of :math:`b_1, b_2, b_3` and :math:`b_4` are the ones that +minimize the following objective function [#f1]_: + +.. math:: + \begin{align} + E(b_1, b_2, b_3, b_4) + &= \sum_i f^2(b_1, b_2, b_3, b_4 ; x_i, y_i)\\ + &= \sum_i \left(\frac{b_1}{(1+e^{b_2-b_3x_i})^{1/b_4}} - y_i\right)^2\\ + \end{align} + +To solve this problem using Ceres Solver, we need to define a +:class:`CostFunction` that computes the residual :math:`f` for a given +:math:`x` and :math:`y` and its derivatives with respect to +:math:`b_1, b_2, b_3` and :math:`b_4`. + +Using elementary differential calculus, we can see that: + +.. math:: + \begin{align} + D_1 f(b_1, b_2, b_3, b_4; x,y) &= \frac{1}{(1+e^{b_2-b_3x})^{1/b_4}}\\ + D_2 f(b_1, b_2, b_3, b_4; x,y) &= + \frac{-b_1e^{b_2-b_3x}}{b_4(1+e^{b_2-b_3x})^{1/b_4 + 1}} \\ + D_3 f(b_1, b_2, b_3, b_4; x,y) &= + \frac{b_1xe^{b_2-b_3x}}{b_4(1+e^{b_2-b_3x})^{1/b_4 + 1}} \\ + D_4 f(b_1, b_2, b_3, b_4; x,y) & = \frac{b_1 \log\left(1+e^{b_2-b_3x}\right) }{b_4^2(1+e^{b_2-b_3x})^{1/b_4}} + \end{align} + +With these derivatives in hand, we can now implement the +:class:`CostFunction` as: + +.. code-block:: c++ + + class Rat43Analytic : public SizedCostFunction<1,4> { + public: + Rat43Analytic(const double x, const double y) : x_(x), y_(y) {} + virtual ~Rat43Analytic() {} + virtual bool Evaluate(double const* const* parameters, + double* residuals, + double** jacobians) const { + const double b1 = parameters[0][0]; + const double b2 = parameters[0][1]; + const double b3 = parameters[0][2]; + const double b4 = parameters[0][3]; + + residuals[0] = b1 * pow(1 + exp(b2 - b3 * x_), -1.0 / b4) - y_; + + if (!jacobians) return true; + double* jacobian = jacobians[0]; + if (!jacobian) return true; + + jacobian[0] = pow(1 + exp(b2 - b3 * x_), -1.0 / b4); + jacobian[1] = -b1 * exp(b2 - b3 * x_) * + pow(1 + exp(b2 - b3 * x_), -1.0 / b4 - 1) / b4; + jacobian[2] = x_ * b1 * exp(b2 - b3 * x_) * + pow(1 + exp(b2 - b3 * x_), -1.0 / b4 - 1) / b4; + jacobian[3] = b1 * log(1 + exp(b2 - b3 * x_)) * + pow(1 + exp(b2 - b3 * x_), -1.0 / b4) / (b4 * b4); + return true; + } + + private: + const double x_; + const double y_; + }; + +This is tedious code, hard to read and with a lot of +redundancy. So in practice we will cache some sub-expressions to +improve its efficiency, which would give us something like: + +.. code-block:: c++ + + class Rat43AnalyticOptimized : public SizedCostFunction<1,4> { + public: + Rat43AnalyticOptimized(const double x, const double y) : x_(x), y_(y) {} + virtual ~Rat43AnalyticOptimized() {} + virtual bool Evaluate(double const* const* parameters, + double* residuals, + double** jacobians) const { + const double b1 = parameters[0][0]; + const double b2 = parameters[0][1]; + const double b3 = parameters[0][2]; + const double b4 = parameters[0][3]; + + const double t1 = exp(b2 - b3 * x_); + const double t2 = 1 + t1; + const double t3 = pow(t2, -1.0 / b4); + residuals[0] = b1 * t3 - y_; + + if (!jacobians) return true; + double* jacobian = jacobians[0]; + if (!jacobian) return true; + + const double t4 = pow(t2, -1.0 / b4 - 1); + jacobian[0] = t3; + jacobian[1] = -b1 * t1 * t4 / b4; + jacobian[2] = -x_ * jacobian[1]; + jacobian[3] = b1 * log(t2) * t3 / (b4 * b4); + return true; + } + + private: + const double x_; + const double y_; + }; + +What is the difference in performance of these two implementations? + +========================== ========= +CostFunction Time (ns) +========================== ========= +Rat43Analytic 255 +Rat43AnalyticOptimized 92 +========================== ========= + +``Rat43AnalyticOptimized`` is :math:`2.8` times faster than +``Rat43Analytic``. This difference in run-time is not uncommon. To +get the best performance out of analytically computed derivatives, one +usually needs to optimize the code to account for common +sub-expressions. + + +When should you use analytical derivatives? +=========================================== + +#. The expressions are simple, e.g. mostly linear. + +#. A computer algebra system like `Maple + <https://www.maplesoft.com/products/maple/>`_ , `Mathematica + <https://www.wolfram.com/mathematica/>`_, or `SymPy + <http://www.sympy.org/en/index.html>`_ can be used to symbolically + differentiate the objective function and generate the C++ to + evaluate them. + +#. Performance is of utmost concern and there is algebraic structure + in the terms that you can exploit to get better performance than + automatic differentiation. + + That said, getting the best performance out of analytical + derivatives requires a non-trivial amount of work. Before going + down this path, it is useful to measure the amount of time being + spent evaluating the Jacobian as a fraction of the total solve time + and remember `Amdahl's Law + <https://en.wikipedia.org/wiki/Amdahl's_law>`_ is your friend. + +#. There is no other way to compute the derivatives, e.g. you + wish to compute the derivative of the root of a polynomial: + + .. math:: + a_3(x,y)z^3 + a_2(x,y)z^2 + a_1(x,y)z + a_0(x,y) = 0 + + + with respect to :math:`x` and :math:`y`. This requires the use of + the `Inverse Function Theorem + <https://en.wikipedia.org/wiki/Inverse_function_theorem>`_ + +#. You love the chain rule and actually enjoy doing all the algebra by + hand. + + +.. rubric:: Footnotes + +.. [#f1] The notion of best fit depends on the choice of the objective + function used to measure the quality of fit, which in turn + depends on the underlying noise process which generated the + observations. Minimizing the sum of squared differences is + the right thing to do when the noise is `Gaussian + <https://en.wikipedia.org/wiki/Normal_distribution>`_. In + that case the optimal value of the parameters is the `Maximum + Likelihood Estimate + <https://en.wikipedia.org/wiki/Maximum_likelihood_estimation>`_.
diff --git a/docs/source/automatic_derivatives.rst b/docs/source/automatic_derivatives.rst new file mode 100644 index 0000000..47a10af --- /dev/null +++ b/docs/source/automatic_derivatives.rst
@@ -0,0 +1,307 @@ +.. default-domain:: cpp + +.. cpp:namespace:: ceres + +.. _chapter-automatic_derivatives: + +===================== +Automatic Derivatives +===================== + +We will now consider automatic differentiation. It is a technique that +can compute exact derivatives, fast, while requiring about the same +effort from the user as is needed to use numerical differentiation. + +Don't believe me? Well here goes. The following code fragment +implements an automatically differentiated ``CostFunction`` for `Rat43 +<http://www.itl.nist.gov/div898/strd/nls/data/ratkowsky3.shtml>`_. + +.. code-block:: c++ + + struct Rat43CostFunctor { + Rat43CostFunctor(const double x, const double y) : x_(x), y_(y) {} + + template <typename T> + bool operator()(const T* parameters, T* residuals) const { + const T b1 = parameters[0]; + const T b2 = parameters[1]; + const T b3 = parameters[2]; + const T b4 = parameters[3]; + residuals[0] = b1 * pow(1.0 + exp(b2 - b3 * x_), -1.0 / b4) - y_; + return true; + } + + private: + const double x_; + const double y_; + }; + + + CostFunction* cost_function = + new AutoDiffCostFunction<Rat43CostFunctor, 1, 4>( + new Rat43CostFunctor(x, y)); + +Notice that compared to numeric differentiation, the only difference +when defining the functor for use with automatic differentiation is +the signature of the ``operator()``. + +In the case of numeric differentition it was + +.. code-block:: c++ + + bool operator()(const double* parameters, double* residuals) const; + +and for automatic differentiation it is a templated function of the +form + +.. code-block:: c++ + + template <typename T> bool operator()(const T* parameters, T* residuals) const; + + +So what does this small change buy us? The following table compares +the time it takes to evaluate the residual and the Jacobian for +`Rat43` using various methods. + +========================== ========= +CostFunction Time (ns) +========================== ========= +Rat43Analytic 255 +Rat43AnalyticOptimized 92 +Rat43NumericDiffForward 262 +Rat43NumericDiffCentral 517 +Rat43NumericDiffRidders 3760 +Rat43AutomaticDiff 129 +========================== ========= + +We can get exact derivatives using automatic differentiation +(``Rat43AutomaticDiff``) with about the same effort that is required +to write the code for numeric differentiation but only :math:`40\%` +slower than hand optimized analytical derivatives. + +So how does it work? For this we will have to learn about **Dual +Numbers** and **Jets** . + + +Dual Numbers & Jets +=================== + +.. NOTE:: + + Reading this and the next section on implementing Jets is not + necessary to use automatic differentiation in Ceres Solver. But + knowing the basics of how Jets work is useful when debugging and + reasoning about the performance of automatic differentiation. + +Dual numbers are an extension of the real numbers analogous to complex +numbers: whereas complex numbers augment the reals by introducing an +imaginary unit :math:`\iota` such that :math:`\iota^2 = -1`, dual +numbers introduce an *infinitesimal* unit :math:`\epsilon` such that +:math:`\epsilon^2 = 0` . A dual number :math:`a + v\epsilon` has two +components, the *real* component :math:`a` and the *infinitesimal* +component :math:`v`. + +Surprisingly, this simple change leads to a convenient method for +computing exact derivatives without needing to manipulate complicated +symbolic expressions. + +For example, consider the function + +.. math:: + + f(x) = x^2 , + +Then, + +.. math:: + + \begin{align} + f(10 + \epsilon) &= (10 + \epsilon)^2\\ + &= 100 + 20 \epsilon + \epsilon^2\\ + &= 100 + 20 \epsilon + \end{align} + +Observe that the coefficient of :math:`\epsilon` is :math:`Df(10) = +20`. Indeed this generalizes to functions which are not +polynomial. Consider an arbitrary differentiable function +:math:`f(x)`. Then we can evaluate :math:`f(x + \epsilon)` by +considering the Taylor expansion of :math:`f` near :math:`x`, which +gives us the infinite series + +.. math:: + \begin{align} + f(x + \epsilon) &= f(x) + Df(x) \epsilon + D^2f(x) + \frac{\epsilon^2}{2} + D^3f(x) \frac{\epsilon^3}{6} + \cdots\\ + f(x + \epsilon) &= f(x) + Df(x) \epsilon + \end{align} + +Here we are using the fact that :math:`\epsilon^2 = 0`. + +A `Jet <https://en.wikipedia.org/wiki/Jet_(mathematics)>`_ is a +:math:`n`-dimensional dual number, where we augment the real numbers +with :math:`n` infinitesimal units :math:`\epsilon_i,\ i=1,...,n` with +the property that :math:`\forall i, j\ :\epsilon_i\epsilon_j = 0`. Then +a Jet consists of a *real* part :math:`a` and a :math:`n`-dimensional +*infinitesimal* part :math:`\mathbf{v}`, i.e., + +.. math:: + x = a + \sum_j v_{j} \epsilon_j + +The summation notation gets tedious, so we will also just write + +.. math:: + x = a + \mathbf{v}. + +where the :math:`\epsilon_i`'s are implict. Then, using the same +Taylor series expansion used above, we can see that: + +.. math:: + + f(a + \mathbf{v}) = f(a) + Df(a) \mathbf{v}. + +Similarly for a multivariate function +:math:`f:\mathbb{R}^{n}\rightarrow \mathbb{R}^m`, evaluated on +:math:`x_i = a_i + \mathbf{v}_i,\ \forall i = 1,...,n`: + +.. math:: + f(x_1,..., x_n) = f(a_1, ..., a_n) + \sum_i D_i f(a_1, ..., a_n) \mathbf{v}_i + +So if each :math:`\mathbf{v}_i = e_i` were the :math:`i^{\text{th}}` +standard basis vector, then, the above expression would simplify to + +.. math:: + f(x_1,..., x_n) = f(a_1, ..., a_n) + \sum_i D_i f(a_1, ..., a_n) \epsilon_i + +and we can extract the coordinates of the Jacobian by inspecting the +coefficients of :math:`\epsilon_i`. + +Implementing Jets +----------------- + +In order for the above to work in practice, we will need the ability +to evaluate an arbitrary function :math:`f` not just on real numbers +but also on dual numbers, but one does not usually evaluate functions +by evaluating their Taylor expansions, + +This is where C++ templates and operator overloading comes into +play. The following code fragment has a simple implementation of a +``Jet`` and some operators/functions that operate on them. + +.. code-block:: c++ + + template<int N> struct Jet { + double a; + Eigen::Matrix<double, 1, N> v; + }; + + template<int N> Jet<N> operator+(const Jet<N>& f, const Jet<N>& g) { + return Jet<N>(f.a + g.a, f.v + g.v); + } + + template<int N> Jet<N> operator-(const Jet<N>& f, const Jet<N>& g) { + return Jet<N>(f.a - g.a, f.v - g.v); + } + + template<int N> Jet<N> operator*(const Jet<N>& f, const Jet<N>& g) { + return Jet<N>(f.a * g.a, f.a * g.v + f.v * g.a); + } + + template<int N> Jet<N> operator/(const Jet<N>& f, const Jet<N>& g) { + return Jet<N>(f.a / g.a, f.v / g.a - f.a * g.v / (g.a * g.a)); + } + + template <int N> Jet<N> exp(const Jet<N>& f) { + return Jet<T, N>(exp(f.a), exp(f.a) * f.v); + } + + // This is a simple implementation for illustration purposes, the + // actual implementation of pow requires careful handling of a number + // of corner cases. + template <int N> Jet<N> pow(const Jet<N>& f, const Jet<N>& g) { + return Jet<N>(pow(f.a, g.a), + g.a * pow(f.a, g.a - 1.0) * f.v + + pow(f.a, g.a) * log(f.a); * g.v); + } + + +With these overloaded functions in hand, we can now call +``Rat43CostFunctor`` with an array of Jets instead of doubles. Putting +that together with appropriately initialized Jets allows us to compute +the Jacobian as follows: + +.. code-block:: c++ + + class Rat43Automatic : public ceres::SizedCostFunction<1,4> { + public: + Rat43Automatic(const Rat43CostFunctor* functor) : functor_(functor) {} + virtual ~Rat43Automatic() {} + virtual bool Evaluate(double const* const* parameters, + double* residuals, + double** jacobians) const { + // Just evaluate the residuals if Jacobians are not required. + if (!jacobians) return (*functor_)(parameters[0], residuals); + + // Initialize the Jets + ceres::Jet<4> jets[4]; + for (int i = 0; i < 4; ++i) { + jets[i].a = parameters[0][i]; + jets[i].v.setZero(); + jets[i].v[i] = 1.0; + } + + ceres::Jet<4> result; + (*functor_)(jets, &result); + + // Copy the values out of the Jet. + residuals[0] = result.a; + for (int i = 0; i < 4; ++i) { + jacobians[0][i] = result.v[i]; + } + return true; + } + + private: + std::unique_ptr<const Rat43CostFunctor> functor_; + }; + +Indeed, this is essentially how :class:`AutoDiffCostFunction` works. + + +Pitfalls +======== + +Automatic differentiation frees the user from the burden of computing +and reasoning about the symbolic expressions for the Jacobians, but +this freedom comes at a cost. For example consider the following +simple functor: + +.. code-block:: c++ + + struct Functor { + template <typename T> bool operator()(const T* x, T* residual) const { + residual[0] = 1.0 - sqrt(x[0] * x[0] + x[1] * x[1]); + return true; + } + }; + +Looking at the code for the residual computation, one does not foresee +any problems. However, if we look at the analytical expressions for +the Jacobian: + +.. math:: + + y &= 1 - \sqrt{x_0^2 + x_1^2}\\ + D_1y &= -\frac{x_0}{\sqrt{x_0^2 + x_1^2}},\ + D_2y = -\frac{x_1}{\sqrt{x_0^2 + x_1^2}} + +we find that it is an indeterminate form at :math:`x_0 = 0, x_1 = +0`. + +There is no single solution to this problem. In some cases one needs +to reason explicitly about the points where indeterminacy may occur +and use alternate expressions using `L'Hopital's rule +<https://en.wikipedia.org/wiki/L'H%C3%B4pital's_rule>`_ (see for +example some of the conversion routines in `rotation.h +<https://github.com/ceres-solver/ceres-solver/blob/master/include/ceres/rotation.h>`_. In +other cases, one may need to regularize the expressions to eliminate +these points.
diff --git a/docs/source/contributing.rst b/docs/source/contributing.rst index fa36d2c..01217a2 100644 --- a/docs/source/contributing.rst +++ b/docs/source/contributing.rst
@@ -51,7 +51,7 @@ 4. Build Ceres, following the instructions in - :ref:`chapter-building`. + :ref:`chapter-installation`. On Mac and Linux, the ``CMake`` build will download and enable the Gerrit pre-commit hook automatically. This pre-submit hook
diff --git a/docs/source/derivatives.rst b/docs/source/derivatives.rst index e0b9916..0483876 100644 --- a/docs/source/derivatives.rst +++ b/docs/source/derivatives.rst
@@ -8,11 +8,6 @@ On Derivatives ============== -.. _section-introduction: - -Introduction -============ - Ceres Solver, like all gradient based optimization algorithms, depends on being able to evaluate the objective function and its derivatives at arbitrary points in its domain. Indeed, defining the objective @@ -25,15 +20,16 @@ Ceres Solver offers considerable flexibility in how the user can provide derivatives to the solver. She can use: - 1. :ref:`section-analytic_derivatives`: The user figures out the - derivatives herself, by hand or using a tool like - `Maple <https://www.maplesoft.com/products/maple/>`_ or - `Mathematica <https://www.wolfram.com/mathematica/>`_, and - implements them in a :class:`CostFunction`. - 2. :ref:`section-numerical_derivatives`: Ceres numerically computes - the derivative using finite differences. - 3. :ref:`section-automatic_derivatives`: Ceres automatically computes - the analytic derivative. +#. :ref:`chapter-analytical_derivatives`: The user figures out the + derivatives herself, by hand or using a tool like `Maple + <https://www.maplesoft.com/products/maple/>`_ or `Mathematica + <https://www.wolfram.com/mathematica/>`_, and implements them in a + :class:`CostFunction`. +#. :ref:`chapter-numerical_derivatives`: Ceres numerically computes + the derivative using finite differences. +#. :ref:`chapter-automatic_derivatives`: Ceres automatically computes + the analytic derivative using C++ templates and operator + overloading. Which of these three approaches (alone or in combination) should be used depends on the situation and the tradeoffs the user is willing to @@ -44,959 +40,21 @@ three approaches in the context of Ceres Solver with sufficient detail that the user can make an informed choice. -High Level Advice ------------------ - For the impatient amongst you, here is some high level advice: - 1. Use :ref:`section-automatic_derivatives`. - 2. In some cases it maybe worth using - :ref:`section-analytic_derivatives`. - 3. Avoid :ref:`section-numerical_derivatives`. Use it as a measure of - last resort, mostly to interface with external libraries. +#. Use :ref:`chapter-automatic_derivatives`. +#. In some cases it maybe worth using + :ref:`chapter-analytical_derivatives`. +#. Avoid :ref:`chapter-numerical_derivatives`. Use it as a measure of + last resort, mostly to interface with external libraries. -.. _section-spivak_notation: +for the rest, read on. -Spivak Notation -=============== +.. toctree:: + :maxdepth: 1 -To preserve our collective sanities, we will use Spivak's notation for -derivatives. It is a functional notation that makes reading and -reasoning about expressions involving derivatives simple. - -For a univariate function :math:`f`, :math:`f(a)` denotes its value at -:math:`a`. :math:`Df` denotes its first derivative, and -:math:`Df(a)` is the derivative evaluated at :math:`a`, i.e - -.. math:: - Df(a) = \left . \frac{d}{dx} f(x) \right |_{x = a} - -:math:`D^nf` denotes the :math:`n^{\text{th}}` derivative of :math:`f`. - -For a bi-variate function :math:`g(x,y)`. :math:`D_1g` and -:math:`D_2g` denote the partial derivatives of :math:`g` w.r.t the -first and second variable respectively. In the classical notation this -is equivalent to saying: - -.. math:: - - D_1 g = \frac{\partial}{\partial x}g(x,y) \text{ and } D_2 g = \frac{\partial}{\partial y}g(x,y). - - -:math:`Dg` denotes the Jacobian of `g`, i.e., - -.. math:: - - Dg = \begin{bmatrix} D_1g & D_2g \end{bmatrix} - -More generally for a multivariate function :math:`g:\mathbb{R}^m -\rightarrow \mathbb{R}^n`, :math:`Dg` denotes the :math:`n\times m` -Jacobian matrix. :math:`D_i g` is the partial derivative of :math:`g` -w.r.t the :math:`i^{\text{th}}` coordinate and the -:math:`i^{\text{th}}` column of :math:`Dg`. - -Finally, :math:`D^2_1g, D_1D_2g` have the obvious meaning as higher -order partial derivatives derivatives. - -For more see Michael Spivak's book `Calculus on Manifolds -<https://www.amazon.com/Calculus-Manifolds-Approach-Classical-Theorems/dp/0805390219>`_ -or a brief discussion of the `merits of this notation -<http://www.vendian.org/mncharity/dir3/dxdoc/>`_ by -Mitchell N. Charity. - -.. _section-analytic_derivatives: - -Analytic Derivatives -==================== - -Consider the problem of fitting the following curve (`Rat43 -<http://www.itl.nist.gov/div898/strd/nls/data/ratkowsky3.shtml>`_) to -data: - -.. math:: - y = \frac{b_1}{(1+e^{b_2-b_3x})^{1/b_4}} - -That is, given some data :math:`\{x_i, y_i\},\ \forall i=1,... ,n`, -determine parameters :math:`b_1, b_2, b_3` and :math:`b_4` that best -fit this data. - -Which can be stated as the problem of finding the -values of :math:`b_1, b_2, b_3` and :math:`b_4` are the ones that -minimize the following objective function [#f1]_: - -.. math:: - \begin{align} - E(b_1, b_2, b_3, b_4) - &= \sum_i f^2(b_1, b_2, b_3, b_4 ; x_i, y_i)\\ - &= \sum_i \left(\frac{b_1}{(1+e^{b_2-b_3x_i})^{1/b_4}} - y_i\right)^2\\ - \end{align} - -To solve this problem using Ceres Solver, we need to define a -:class:`CostFunction` that computes the residual :math:`f` for a given -:math:`x` and :math:`y` and its derivatives with respect to -:math:`b_1, b_2, b_3` and :math:`b_4`. - -Using elementary differential calculus, we can see that: - -.. math:: - \begin{align} - D_1 f(b_1, b_2, b_3, b_4; x,y) &= \frac{1}{(1+e^{b_2-b_3x})^{1/b_4}}\\ - D_2 f(b_1, b_2, b_3, b_4; x,y) &= - \frac{-b_1e^{b_2-b_3x}}{b_4(1+e^{b_2-b_3x})^{1/b_4 + 1}} \\ - D_3 f(b_1, b_2, b_3, b_4; x,y) &= - \frac{b_1xe^{b_2-b_3x}}{b_4(1+e^{b_2-b_3x})^{1/b_4 + 1}} \\ - D_4 f(b_1, b_2, b_3, b_4; x,y) & = \frac{b_1 \log\left(1+e^{b_2-b_3x}\right) }{b_4^2(1+e^{b_2-b_3x})^{1/b_4}} - \end{align} - -With these derivatives in hand, we can now implement the -:class:`CostFunction` as: - -.. code-block:: c++ - - class Rat43Analytic : public SizedCostFunction<1,4> { - public: - Rat43Analytic(const double x, const double y) : x_(x), y_(y) {} - virtual ~Rat43Analytic() {} - virtual bool Evaluate(double const* const* parameters, - double* residuals, - double** jacobians) const { - const double b1 = parameters[0][0]; - const double b2 = parameters[0][1]; - const double b3 = parameters[0][2]; - const double b4 = parameters[0][3]; - - residuals[0] = b1 * pow(1 + exp(b2 - b3 * x_), -1.0 / b4) - y_; - - if (!jacobians) return true; - double* jacobian = jacobians[0]; - if (!jacobian) return true; - - jacobian[0] = pow(1 + exp(b2 - b3 * x_), -1.0 / b4); - jacobian[1] = -b1 * exp(b2 - b3 * x_) * - pow(1 + exp(b2 - b3 * x_), -1.0 / b4 - 1) / b4; - jacobian[2] = x_ * b1 * exp(b2 - b3 * x_) * - pow(1 + exp(b2 - b3 * x_), -1.0 / b4 - 1) / b4; - jacobian[3] = b1 * log(1 + exp(b2 - b3 * x_)) * - pow(1 + exp(b2 - b3 * x_), -1.0 / b4) / (b4 * b4); - return true; - } - - private: - const double x_; - const double y_; - }; - -This is tedious code, hard to read and with a lot of -redundancy. So in practice we will cache some sub-expressions to -improve its efficiency, which would give us something like: - -.. code-block:: c++ - - class Rat43AnalyticOptimized : public SizedCostFunction<1,4> { - public: - Rat43AnalyticOptimized(const double x, const double y) : x_(x), y_(y) {} - virtual ~Rat43AnalyticOptimized() {} - virtual bool Evaluate(double const* const* parameters, - double* residuals, - double** jacobians) const { - const double b1 = parameters[0][0]; - const double b2 = parameters[0][1]; - const double b3 = parameters[0][2]; - const double b4 = parameters[0][3]; - - const double t1 = exp(b2 - b3 * x_); - const double t2 = 1 + t1; - const double t3 = pow(t2, -1.0 / b4); - residuals[0] = b1 * t3 - y_; - - if (!jacobians) return true; - double* jacobian = jacobians[0]; - if (!jacobian) return true; - - const double t4 = pow(t2, -1.0 / b4 - 1); - jacobian[0] = t3; - jacobian[1] = -b1 * t1 * t4 / b4; - jacobian[2] = -x_ * jacobian[1]; - jacobian[3] = b1 * log(t2) * t3 / (b4 * b4); - return true; - } - - private: - const double x_; - const double y_; - }; - -What is the difference in performance of these two implementations? - -========================== ========= -CostFunction Time (ns) -========================== ========= -Rat43Analytic 255 -Rat43AnalyticOptimized 92 -========================== ========= - -``Rat43AnalyticOptimized`` is :math:`2.8` times faster than -``Rat43Analytic``. This difference in run-time is not uncommon. To -get the best performance out of analytically computed derivatives, one -usually needs to optimize the code to account for common -sub-expressions. - - -When should you use analytical derivatives? -------------------------------------------- - -#. The expressions are simple, e.g. mostly linear. - -#. A computer algebra system like `Maple - <https://www.maplesoft.com/products/maple/>`_ , `Mathematica - <https://www.wolfram.com/mathematica/>`_, or `SymPy - <http://www.sympy.org/en/index.html>`_ can be used to symbolically - differentiate the objective function and generate the C++ to - evaluate them. - -#. Performance is of utmost concern and there is algebraic structure - in the terms that you can exploit to get better performance than - automatic differentiation. - - That said, getting the best performance out of analytical - derivatives requires a non-trivial amount of work. Before going - down this path, it is useful to measure the amount of time being - spent evaluating the Jacobian as a fraction of the total solve time - and remember `Amdahl's Law - <https://en.wikipedia.org/wiki/Amdahl's_law>`_ is your friend. - -#. There is no other way to compute the derivatives, e.g. you - wish to compute the derivative of the root of a polynomial: - - .. math:: - a_3(x,y)z^3 + a_2(x,y)z^2 + a_1(x,y)z + a_0(x,y) = 0 - - - with respect to :math:`x` and :math:`y`. This requires the use of - the `Inverse Function Theorem - <https://en.wikipedia.org/wiki/Inverse_function_theorem>`_ - -#. You love the chain rule and actually enjoy doing all the algebra by - hand. - - -.. _section-numerical_derivatives: - -Numeric derivatives -=================== - -The other extreme from using analytic derivatives is to use numeric -derivatives. The key observation here is that the process of -differentiating a function :math:`f(x)` w.r.t :math:`x` can be written -as the limiting process: - -.. math:: - Df(x) = \lim_{h \rightarrow 0} \frac{f(x + h) - f(x)}{h} - - -Forward Differences -------------------- - -Now of course one cannot perform the limiting operation numerically on -a computer so we do the next best thing, which is to choose a small -value of :math:`h` and approximate the derivative as - -.. math:: - Df(x) \approx \frac{f(x + h) - f(x)}{h} - - -The above formula is the simplest most basic form of numeric -differentiation. It is known as the *Forward Difference* formula. - -So how would one go about constructing a numerically differentiated -version of ``Rat43Analytic`` in Ceres Solver. This is done in two -steps: - - 1. Define *Functor* that given the parameter values will evaluate the - residual for a given :math:`(x,y)`. - 2. Construct a :class:`CostFunction` by using - :class:`NumericDiffCostFunction` to wrap an instance of - ``Rat43CostFunctor``. - -.. code-block:: c++ - - struct Rat43CostFunctor { - Rat43CostFunctor(const double x, const double y) : x_(x), y_(y) {} - - bool operator()(const double* parameters, double* residuals) const { - const double b1 = parameters[0]; - const double b2 = parameters[1]; - const double b3 = parameters[2]; - const double b4 = parameters[3]; - residuals[0] = b1 * pow(1.0 + exp(b2 - b3 * x_), -1.0 / b4) - y_; - return true; - } - - const double x_; - const double y_; - } - - CostFunction* cost_function = - new NumericDiffCostFunction<Rat43CostFunctor, FORWARD, 1, 4>( - new Rat43CostFunctor(x, y)); - -This is about the minimum amount of work one can expect to do to -define the cost function. The only thing that the user needs to do is -to make sure that the evaluation of the residual is implemented -correctly and efficiently. - -Before going further, it is instructive to get an estimate of the -error in the forward difference formula. We do this by considering the -`Taylor expansion <https://en.wikipedia.org/wiki/Taylor_series>`_ of -:math:`f` near :math:`x`. - -.. math:: - \begin{align} - f(x+h) &= f(x) + h Df(x) + \frac{h^2}{2!} D^2f(x) + - \frac{h^3}{3!}D^3f(x) + \cdots \\ - Df(x) &= \frac{f(x + h) - f(x)}{h} - \left [\frac{h}{2!}D^2f(x) + - \frac{h^2}{3!}D^3f(x) + \cdots \right]\\ - Df(x) &= \frac{f(x + h) - f(x)}{h} + O(h) - \end{align} - -i.e., the error in the forward difference formula is -:math:`O(h)` [#f4]_. - - -Implementation Details -^^^^^^^^^^^^^^^^^^^^^^ - -:class:`NumericDiffCostFunction` implements a generic algorithm to -numerically differentiate a given functor. While the actual -implementation of :class:`NumericDiffCostFunction` is complicated, the -net result is a :class:`CostFunction` that roughly looks something -like the following: - -.. code-block:: c++ - - class Rat43NumericDiffForward : public SizedCostFunction<1,4> { - public: - Rat43NumericDiffForward(const Rat43Functor* functor) : functor_(functor) {} - virtual ~Rat43NumericDiffForward() {} - virtual bool Evaluate(double const* const* parameters, - double* residuals, - double** jacobians) const { - functor_(parameters[0], residuals); - if (!jacobians) return true; - double* jacobian = jacobians[0]; - if (!jacobian) return true; - - const double f = residuals[0]; - double parameters_plus_h[4]; - for (int i = 0; i < 4; ++i) { - std::copy(parameters, parameters + 4, parameters_plus_h); - const double kRelativeStepSize = 1e-6; - const double h = std::abs(parameters[i]) * kRelativeStepSize; - parameters_plus_h[i] += h; - double f_plus; - functor_(parameters_plus_h, &f_plus); - jacobian[i] = (f_plus - f) / h; - } - return true; - } - - private: - scoped_ptr<Rat43Functor> functor_; - }; - - -Note the choice of step size :math:`h` in the above code, instead of -an absolute step size which is the same for all parameters, we use a -relative step size of :math:`\text{kRelativeStepSize} = 10^{-6}`. This -gives better derivative estimates than an absolute step size [#f2]_ -[#f3]_. This choice of step size only works for parameter values that -are not close to zero. So the actual implementation of -:class:`NumericDiffCostFunction`, uses a more complex step size -selection logic, where close to zero, it switches to a fixed step -size. - - -Central Differences -------------------- - -:math:`O(h)` error in the Forward Difference formula is okay but not -great. A better method is to use the *Central Difference* formula: - -.. math:: - Df(x) \approx \frac{f(x + h) - f(x - h)}{2h} - -Notice that if the value of :math:`f(x)` is known, the Forward -Difference formula only requires one extra evaluation, but the Central -Difference formula requires two evaluations, making it twice as -expensive. So is the extra evaluation worth it? - -To answer this question, we again compute the error of approximation -in the central difference formula: - -.. math:: - \begin{align} - f(x + h) &= f(x) + h Df(x) + \frac{h^2}{2!} - D^2f(x) + \frac{h^3}{3!} D^3f(x) + \frac{h^4}{4!} D^4f(x) + \cdots\\ - f(x - h) &= f(x) - h Df(x) + \frac{h^2}{2!} - D^2f(x) - \frac{h^3}{3!} D^3f(c_2) + \frac{h^4}{4!} D^4f(x) + - \cdots\\ - Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + \frac{h^2}{3!} - D^3f(x) + \frac{h^4}{5!} - D^5f(x) + \cdots \\ - Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + O(h^2) - \end{align} - -The error of the Central Difference formula is :math:`O(h^2)`, i.e., -the error goes down quadratically whereas the error in the Forward -Difference formula only goes down linearly. - -Using central differences instead of forward differences in Ceres -Solver is a simple matter of changing a template argument to -:class:`NumericDiffCostFunction` as follows: - -.. code-block:: c++ - - CostFunction* cost_function = - new NumericDiffCostFunction<Rat43CostFunctor, CENTRAL, 1, 4>( - new Rat43CostFunctor(x, y)); - -But what do these differences in the error mean in practice? To see -this, consider the problem of evaluating the derivative of the -univariate function - -.. math:: - f(x) = \frac{e^x}{\sin x - x^2}, - -at :math:`x = 1.0`. - -It is straightforward to see that :math:`Df(1.0) = -140.73773557129658`. Using this value as reference, we can now compute -the relative error in the forward and central difference formulae as a -function of the absolute step size and plot them. - -.. figure:: forward_central_error.png - :figwidth: 100% - :align: center - -Reading the graph from right to left, a number of things stand out in -the above graph: - - 1. The graph for both formulae have two distinct regions. At first, - starting from a large value of :math:`h` the error goes down as - the effect of truncating the Taylor series dominates, but as the - value of :math:`h` continues to decrease, the error starts - increasing again as roundoff error starts to dominate the - computation. So we cannot just keep on reducing the value of - :math:`h` to get better estimates of :math:`Df`. The fact that we - are using finite precision arithmetic becomes a limiting factor. - 2. Forward Difference formula is not a great method for evaluating - derivatives. Central Difference formula converges much more - quickly to a more accurate estimate of the derivative with - decreasing step size. So unless the evaluation of :math:`f(x)` is - so expensive that you absolutely cannot afford the extra - evaluation required by central differences, **do not use the - Forward Difference formula**. - 3. Neither formula works well for a poorly chosen value of :math:`h`. - - -Ridders' Method ---------------- -So, can we get better estimates of :math:`Df` without requiring such -small values of :math:`h` that we start hitting floating point -roundoff errors? - -One possible approach is to find a method whose error goes down faster -than :math:`O(h^2)`. This can be done by applying `Richardson -Extrapolation -<https://en.wikipedia.org/wiki/Richardson_extrapolation>`_ to the -problem of differentiation. This is also known as *Ridders' Method* -[Ridders]_. - -Let us recall, the error in the central differences formula. - -.. math:: - \begin{align} - Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + \frac{h^2}{3!} - D^3f(x) + \frac{h^4}{5!} - D^5f(x) + \cdots\\ - & = \frac{f(x + h) - f(x - h)}{2h} + K_2 h^2 + K_4 h^4 + \cdots - \end{align} - -The key thing to note here is that the terms :math:`K_2, K_4, ...` -are indepdendent of :math:`h` and only depend on :math:`x`. - -Let us now define: - -.. math:: - - A(1, m) = \frac{f(x + h/2^{m-1}) - f(x - h/2^{m-1})}{2h/2^{m-1}}. - -Then observe that - -.. math:: - - Df(x) = A(1,1) + K_2 h^2 + K_4 h^4 + \cdots - -and - -.. math:: - - Df(x) = A(1, 2) + K_2 (h/2)^2 + K_4 (h/2)^4 + \cdots - -Here we have halved the step size to obtain a second central -differences estimate of :math:`Df(x)`. Combining these two estimates, -we get: - -.. math:: - - Df(x) = \frac{4 A(1, 2) - A(1,1)}{4 - 1} + O(h^4) - -which is an approximation of :math:`Df(x)` with truncation error that -goes down as :math:`O(h^4)`. But we do not have to stop here. We can -iterate this process to obtain even more accurate estimates as -follows: - -.. math:: - - A(n, m) = \begin{cases} - \frac{\displaystyle f(x + h/2^{m-1}) - f(x - - h/2^{m-1})}{\displaystyle 2h/2^{m-1}} & n = 1 \\ - \frac{\displaystyle 4^{n-1} A(n - 1, m + 1) - A(n - 1, m)}{\displaystyle 4^{n-1} - 1} & n > 1 - \end{cases} - -It is straightforward to show that the approximation error in -:math:`A(n, 1)` is :math:`O(h^{2n})`. To see how the above formula can -be implemented in practice to compute :math:`A(n,1)` it is helpful to -structure the computation as the following tableau: - -.. math:: - \begin{array}{ccccc} - A(1,1) & A(1, 2) & A(1, 3) & A(1, 4) & \cdots\\ - & A(2, 1) & A(2, 2) & A(2, 3) & \cdots\\ - & & A(3, 1) & A(3, 2) & \cdots\\ - & & & A(4, 1) & \cdots \\ - & & & & \ddots - \end{array} - -So, to compute :math:`A(n, 1)` for increasing values of :math:`n` we -move from the left to the right, computing one column at a -time. Assuming that the primary cost here is the evaluation of the -function :math:`f(x)`, the cost of computing a new column of the above -tableau is two function evaluations. Since the cost of evaluating -:math:`A(1, n)`, requires evaluating the central difference formula -for step size of :math:`2^{1-n}h` - -Applying this method to :math:`f(x) = \frac{e^x}{\sin x - x^2}` -starting with a fairly large step size :math:`h = 0.01`, we get: - -.. math:: - \begin{array}{rrrrr} - 141.678097131 &140.971663667 &140.796145400 &140.752333523 &140.741384778\\ - &140.736185846 &140.737639311 &140.737729564 &140.737735196\\ - & &140.737736209 &140.737735581 &140.737735571\\ - & & &140.737735571 &140.737735571\\ - & & & &140.737735571\\ - \end{array} - -Compared to the *correct* value :math:`Df(1.0) = 140.73773557129658`, -:math:`A(5, 1)` has a relative error of :math:`10^{-13}`. For -comparison, the relative error for the central difference formula with -the same stepsize (:math:`0.01/2^4 = 0.000625`) is :math:`10^{-5}`. - -The above tableau is the basis of Ridders' method for numeric -differentiation. The full implementation is an adaptive scheme that -tracks its own estimation error and stops automatically when the -desired precision is reached. Of course it is more expensive than the -forward and central difference formulae, but is also significantly -more robust and accurate. - -Using Ridder's method instead of forward or central differences in -Ceres is again a simple matter of changing a template argument to -:class:`NumericDiffCostFunction` as follows: - -.. code-block:: c++ - - CostFunction* cost_function = - new NumericDiffCostFunction<Rat43CostFunctor, RIDDERS, 1, 4>( - new Rat43CostFunctor(x, y)); - -The following graph shows the relative error of the three methods as a -function of the absolute step size. For Ridders's method we assume -that the step size for evaluating :math:`A(n,1)` is :math:`2^{1-n}h`. - -.. figure:: forward_central_ridders_error.png - :figwidth: 100% - :align: center - -Using the 10 function evaluations that are needed to compute -:math:`A(5,1)` we are able to approximate :math:`Df(1.0)` about a 1000 -times better than the best central differences estimate. To put these -numbers in perspective, machine epsilon for double precision -arithmetic is :math:`\approx 2.22 \times 10^{-16}`. - -Going back to ``Rat43``, let us also look at the runtime cost of the -various methods for computing numeric derivatives. - -========================== ========= -CostFunction Time (ns) -========================== ========= -Rat43Analytic 255 -Rat43AnalyticOptimized 92 -Rat43NumericDiffForward 262 -Rat43NumericDiffCentral 517 -Rat43NumericDiffRidders 3760 -========================== ========= - -As expected, Central Differences is about twice as expensive as -Forward Differences and the remarkable accuracy improvements of -Ridders' method cost an order of magnitude more runtime. - -Recommendations ---------------- - -Numeric differentiation should be used when you cannot compute the -derivatives either analytically or using automatic differention. This -is usually the case when you are calling an external library or -function whose analytic form you do not know or even if you do, you -are not in a position to re-write it in a manner required to use -automatic differentiation (discussed below). - -When using numeric differentiation, use at least Central Differences, -and if execution time is not a concern or the objective function is -such that determining a good static relative step size is hard, -Ridders' method is recommended. - -.. _section-automatic_derivatives: - -Automatic Derivatives -===================== - -We will now consider automatic differentiation. It is a technique that -can compute exact derivatives, fast, while requiring about the same -effort from the user as is needed to use numerical differentiation. - -Don't believe me? Well here goes. The following code fragment -implements an automatically differentiated ``CostFunction`` for -``Rat43``. - -.. code-block:: c++ - - struct Rat43CostFunctor { - Rat43CostFunctor(const double x, const double y) : x_(x), y_(y) {} - - template <typename T> - bool operator()(const T* parameters, T* residuals) const { - const T b1 = parameters[0]; - const T b2 = parameters[1]; - const T b3 = parameters[2]; - const T b4 = parameters[3]; - residuals[0] = b1 * pow(1.0 + exp(b2 - b3 * x_), -1.0 / b4) - y_; - return true; - } - - private: - const double x_; - const double y_; - }; - - - CostFunction* cost_function = - new AutoDiffCostFunction<Rat43CostFunctor, 1, 4>( - new Rat43CostFunctor(x, y)); - -Notice that compared to numeric differentiation, the only difference -when defining the functor for use with automatic differentiation is -the signature of the ``operator()``. - -In the case of numeric differentition it was - -.. code-block:: c++ - - bool operator()(const double* parameters, double* residuals) const; - -and for automatic differentiation it is a templated function of the -form - -.. code-block:: c++ - - template <typename T> bool operator()(const T* parameters, T* residuals) const; - - -So what does this small change buy us? The following table compares -the time it takes to evaluate the residual and the Jacobian for -`Rat43` using various methods. - -========================== ========= -CostFunction Time (ns) -========================== ========= -Rat43Analytic 255 -Rat43AnalyticOptimized 92 -Rat43NumericDiffForward 262 -Rat43NumericDiffCentral 517 -Rat43NumericDiffRidders 3760 -Rat43AutomaticDiff 129 -========================== ========= - -We can get exact derivatives using automatic differentiation -(``Rat43AutomaticDiff``) with about the same effort that is required -to write the code for numeric differentiation but only :math:`40\%` -slower than hand optimized analytical derivatives. - -So how does it work? For this we will have to learn about **Dual -Numbers** and **Jets** . - - -Dual Numbers & Jets -------------------- - -.. NOTE:: - - Reading this and the next section on implementing Jets is not - necessary to use automatic differentiation in Ceres Solver. But - knowing the basics of how Jets work is useful when debugging and - reasoning about the performance of automatic differentiation. - -Dual numbers are an extension of the real numbers analogous to complex -numbers: whereas complex numbers augment the reals by introducing an -imaginary unit :math:`\iota` such that :math:`\iota^2 = -1`, dual -numbers introduce an *infinitesimal* unit :math:`\epsilon` such that -:math:`\epsilon^2 = 0` . A dual number :math:`a + v\epsilon` has two -components, the *real* component :math:`a` and the *infinitesimal* -component :math:`v`. - -Surprisingly, this simple change leads to a convenient method for -computing exact derivatives without needing to manipulate complicated -symbolic expressions. - -For example, consider the function - -.. math:: - - f(x) = x^2 , - -Then, - -.. math:: - - \begin{align} - f(10 + \epsilon) &= (10 + \epsilon)^2\\ - &= 100 + 20 \epsilon + \epsilon^2\\ - &= 100 + 20 \epsilon - \end{align} - -Observe that the coefficient of :math:`\epsilon` is :math:`Df(10) = -20`. Indeed this generalizes to functions which are not -polynomial. Consider an arbitrary differentiable function -:math:`f(x)`. Then we can evaluate :math:`f(x + \epsilon)` by -considering the Taylor expansion of :math:`f` near :math:`x`, which -gives us the infinite series - -.. math:: - \begin{align} - f(x + \epsilon) &= f(x) + Df(x) \epsilon + D^2f(x) - \frac{\epsilon^2}{2} + D^3f(x) \frac{\epsilon^3}{6} + \cdots\\ - f(x + \epsilon) &= f(x) + Df(x) \epsilon - \end{align} - -Here we are using the fact that :math:`\epsilon^2 = 0`. - -A `Jet <https://en.wikipedia.org/wiki/Jet_(mathematics)>`_ is a -:math:`n`-dimensional dual number, where we augment the real numbers -with :math:`n` infinitesimal units :math:`\epsilon_i,\ i=1,...,n` with -the property that :math:`\forall i, j\ \epsilon_i\epsilon_j = 0`. Then -a Jet consists of a *real* part :math:`a` and a :math:`n`-dimensional -*infinitesimal* part :math:`\mathbf{v}`, i.e., - -.. math:: - x = a + \sum_j v_{j} \epsilon_j - -The summation notation gets tedius, so we will also just write - -.. math:: - x = a + \mathbf{v}. - -where the :math:`\epsilon_i`'s are implict. Then, using the same -Taylor series expansion used above, we can see that: - -.. math:: - - f(a + \mathbf{v}) = f(a) + Df(a) \mathbf{v}. - -Similarly for a multivariate function -:math:`f:\mathbb{R}^{n}\rightarrow \mathbb{R}^m`, evaluated on -:math:`x_i = a_i + \mathbf{v}_i,\ \forall i = 1,...,n`: - -.. math:: - f(x_1,..., x_n) = f(a_1, ..., a_n) + \sum_i D_i f(a_1, ..., a_n) \mathbf{v}_i - -So if each :math:`\mathbf{v}_i = e_i` were the :math:`i^{\text{th}}` -standard basis vector, then, the above expression would simplify to - -.. math:: - f(x_1,..., x_n) = f(a_1, ..., a_n) + \sum_i D_i f(a_1, ..., a_n) \epsilon_i - -and we can extract the coordinates of the Jacobian by inspecting the -coefficients of :math:`\epsilon_i`. - -Implementing Jets -^^^^^^^^^^^^^^^^^ - -In order for the above to work in practice, we will need the ability -to evaluate arbitrary function :math:`f` not just on real numbers but -also on dual numbers, but one does not usually evaluate functions by -evaluating their Taylor expansions, - -This is where C++ templates and operator overloading comes into -play. The following code fragment has a simple implementation of a -``Jet`` and some operators/functions that operate on them. - -.. code-block:: c++ - - template<int N> struct Jet { - double a; - Eigen::Matrix<double, 1, N> v; - }; - - template<int N> Jet<N> operator+(const Jet<N>& f, const Jet<N>& g) { - return Jet<N>(f.a + g.a, f.v + g.v); - } - - template<int N> Jet<N> operator-(const Jet<N>& f, const Jet<N>& g) { - return Jet<N>(f.a - g.a, f.v - g.v); - } - - template<int N> Jet<N> operator*(const Jet<N>& f, const Jet<N>& g) { - return Jet<N>(f.a * g.a, f.a * g.v + f.v * g.a); - } - - template<int N> Jet<N> operator/(const Jet<N>& f, const Jet<N>& g) { - return Jet<N>(f.a / g.a, f.v / g.a - f.a * g.v / (g.a * g.a)); - } - - template <int N> Jet<N> exp(const Jet<N>& f) { - return Jet<T, N>(exp(f.a), exp(f.a) * f.v); - } - - // This is a simple implementation for illustration purposes, the - // actual implementation of pow requires careful handling of a number - // of corner cases. - template <int N> Jet<N> pow(const Jet<N>& f, const Jet<N>& g) { - return Jet<N>(pow(f.a, g.a), - g.a * pow(f.a, g.a - 1.0) * f.v + - pow(f.a, g.a) * log(f.a); * g.v); - } - - -With these overloaded functions in hand, we can now call -``Rat43CostFunctor`` with an array of Jets instead of doubles. Putting -that together with appropriately initialized Jets allows us to compute -the Jacobian as follows: - -.. code-block:: c++ - - class Rat43Automatic : public ceres::SizedCostFunction<1,4> { - public: - Rat43Automatic(const Rat43CostFunctor* functor) : functor_(functor) {} - virtual ~Rat43Automatic() {} - virtual bool Evaluate(double const* const* parameters, - double* residuals, - double** jacobians) const { - // Just evaluate the residuals if Jacobians are not required. - if (!jacobians) return (*functor_)(parameters[0], residuals); - - // Initialize the Jets - ceres::Jet<4> jets[4]; - for (int i = 0; i < 4; ++i) { - jets[i].a = parameters[0][i]; - jets[i].v.setZero(); - jets[i].v[i] = 1.0; - } - - ceres::Jet<4> result; - (*functor_)(jets, &result); - - // Copy the values out of the Jet. - residuals[0] = result.a; - for (int i = 0; i < 4; ++i) { - jacobians[0][i] = result.v[i]; - } - return true; - } - - private: - std::unique_ptr<const Rat43CostFunctor> functor_; - }; - -Indeed, this is essentially how :class:`AutoDiffCostFunction` works. - - -Pitfalls --------- - -Automatic differentiation frees the user from the burden of computing -and reasoning about the symbolic expressions for the Jacobians, but -this freedom comes at a cost. For example consider the following -simple functor: - -.. code-block:: c++ - - struct Functor { - template <typename T> bool operator()(const T* x, T* residual) const { - residual[0] = 1.0 - sqrt(x[0] * x[0] + x[1] * x[1]); - return true; - } - }; - -Looking at the code for the residual computation, one does not foresee -any problems. However, if we look at the analytical expressions for -the Jacobian: - -.. math:: - - y &= 1 - \sqrt{x_0^2 + x_1^2}\\ - D_1y &= -\frac{x_0}{\sqrt{x_0^2 + x_1^2}},\ - D_2y = -\frac{x_1}{\sqrt{x_0^2 + x_1^2}} - -we find that it is an indeterminate form at :math:`x_0 = 0, x_1 = -0`. - -There is no single solution to this problem. In some cases one needs -to reason explicitly about the points where indeterminacy may occur -and use alternate expressions using `L'Hopital's rule -<https://en.wikipedia.org/wiki/L'H%C3%B4pital's_rule>`_ (see for -example some of the conversion routines in `rotation.h -<https://github.com/ceres-solver/ceres-solver/blob/master/include/ceres/rotation.h>`_. In -other cases, one may need to regularize the expressions to eliminate -these points. - -.. rubric:: Footnotes - -.. [#f1] The notion of best fit depends on the choice of the objective - function used to measure the quality of fit, which in turn - depends on the underlying noise process which generated the - observations. Minimizing the sum of squared differences is - the right thing to do when the noise is `Gaussian - <https://en.wikipedia.org/wiki/Normal_distribution>`_. In - that case the optimal value of the parameters is the `Maximum - Likelihood Estimate - <https://en.wikipedia.org/wiki/Maximum_likelihood_estimation>`_. -.. [#f2] `Numerical Differentiation - <https://en.wikipedia.org/wiki/Numerical_differentiation#Practical_considerations_using_floating_point_arithmetic>`_ -.. [#f3] [Press]_ Numerical Recipes, Section 5.7 -.. [#f4] In asymptotic error analysis, an error of :math:`O(h^k)` - means that the absolute-value of the error is at most some - constant times :math:`h^k` when :math:`h` is close enough to - :math:`0`. - -TODO -==== - -#. Why does the quality of derivatives matter? -#. Discuss, forward v/s backward automatic differentiation and - relation to backprop, impact of large parameter block sizes on - differentiation performance. -#. Inverse function theorem -#. Calling iterative routines. -#. Reference to how numeric derivatives lead to slower convergence. -#. Pitfalls of Numeric differentiation. -#. Ill conditioning of numeric differentiation/dependence on curvature. + spivak_notation + analytical_derivatives + numerical_derivatives + automatic_derivatives + interfacing_with_autodiff
diff --git a/docs/source/interfacing_with_autodiff.rst b/docs/source/interfacing_with_autodiff.rst new file mode 100644 index 0000000..3a661ce --- /dev/null +++ b/docs/source/interfacing_with_autodiff.rst
@@ -0,0 +1,293 @@ +.. default-domain:: cpp + +.. cpp:namespace:: ceres + +.. _chapter-interfacing_with_automatic_differentiation: + +Interfacing with Automatic Differentiation +========================================== + +Automatic differentiation is straightforward to use in cases where an +explicit expression for the cost function is available. But this is +not always possible. Often one has to interface with external routines +or data. In this chapter we will consider a number of different ways +of doing so. + +To do this, we will consider the problem of finding parameters +:math:`\theta` and :math:`t` that solve an optimization problem of the +form: + +.. math:: + \min & \quad \sum_i \left \|y_i - f\left (\|q_{i}\|^2\right) q + \right \|^2\\ + \text{such that} & \quad q_i = R(\theta) x_i + t + +Here, :math:`R` is a two dimensional rotation matrix parameterized +using the angle :math:`\theta` and :math:`t` is a two dimensional +vector. :math:`f` is an external distortion function. + +We begin by considering the case, where we have a templated function +:code:`TemplatedComputeDistortion` that can compute the function +:math:`f`. Then the implementation of the corresponding residual +functor is straightforward and will look as follows: + +.. code-block:: c++ + :emphasize-lines: 21 + + template <typename T> T TemplatedComputeDistortion(const T r2) { + const double k1 = 0.0082; + const double k2 = 0.000023; + return 1.0 + k1 * y2 + k2 * r2 * r2; + } + + struct Affine2DWithDistortion { + Affine2DWithDistortion(const double x_in[2], const double y_in[2]) { + x[0] = x_in[0]; + x[1] = x_in[1]; + y[0] = y_in[0]; + y[1] = y_in[1]; + } + + template <typename T> + bool operator()(const T* theta, + const T* t, + T* residuals) const { + const T q_0 = cos(theta[0]) * x[0] - sin(theta[0]) * x[1] + t[0]; + const T q_1 = -sin(theta[0]) * x[0] + cos(theta[0]) * x[1] + t[1]; + const T f = TemplatedComputeDistortion(q_0 * q_0 + q_1 * q_1); + residuals[0] = y[0] - f * q_0; + residuals[1] = y[1] - f * q_1; + return true; + } + + double x[2]; + double y[2]; + }; + +So far so good, but let us now consider three ways of defining +:math:`f` which are not directly amenable to being used with automatic +differentiation: + +#. A non-templated function that evaluates its value. +#. A function that evaluates its value and derivative. +#. A function that is defined as a table of values to be interpolated. + +We will consider them in turn below. + +A function that returns its value +---------------------------------- + +Suppose we were given a function :code:`ComputeDistortionValue` with +the following signature + +.. code-block:: c++ + + double ComputeDistortionValue(double r2); + +that computes the value of :math:`f`. The actual implementation of the +function does not matter. Interfacing this function with +:code:`Affine2DWithDistortion` is a three step process: + +1. Wrap :code:`ComputeDistortionValue` into a functor + :code:`ComputeDistortionValueFunctor`. +2. Numerically differentiate :code:`ComputeDistortionValueFunctor` + using :class:`NumericDiffCostFunction` to create a + :class:`CostFunction`. +3. Wrap the resulting :class:`CostFunction` object using + :class:`CostFunctionToFunctor`. The resulting object is a functor + with a templated :code:`operator()` method, which pipes the + Jacobian computed by :class:`NumericDiffCostFunction` into the + approproate :code:`Jet` objects. + +An implementation of the above three steps looks as follows: + +.. code-block:: c++ + :emphasize-lines: 15,16,17,18,19,20, 29 + + struct ComputeDistortionValueFunctor { + bool operator()(const double* r2, double* value) const { + *value = ComputeDistortionValue(r2[0]); + return true; + } + }; + + struct Affine2DWithDistortion { + Affine2DWithDistortion(const double x_in[2], const double y_in[2]) { + x[0] = x_in[0]; + x[1] = x_in[1]; + y[0] = y_in[0]; + y[1] = y_in[1]; + + compute_distortion.reset(new ceres::CostFunctionToFunctor<1, 1>( + new ceres::NumericDiffCostFunction<ComputeDistortionValueFunctor, + ceres::CENTRAL, + 1, + 1>( + new ComputeDistortionValueFunctor))); + } + + template <typename T> + bool operator()(const T* theta, const T* t, T* residuals) const { + const T q_0 = cos(theta[0]) * x[0] - sin(theta[0]) * x[1] + t[0]; + const T q_1 = -sin(theta[0]) * x[0] + cos(theta[0]) * x[1] + t[1]; + const T r2 = q_0 * q_0 + q_1 * q_1; + T f; + (*compute_distortion)(&r2, &f); + residuals[0] = y[0] - f * q_0; + residuals[1] = y[1] - f * q_1; + return true; + } + + double x[2]; + double y[2]; + std::unique_ptr<ceres::CostFunctionToFunctor<1, 1> > compute_distortion; + }; + + +A function that returns its value and derivative +------------------------------------------------ + +Now suppose we are given a function :code:`ComputeDistortionValue` +thatis able to compute its value and optionally its Jacobian on demand +and has the following signature: + +.. code-block:: c++ + + void ComputeDistortionValueAndJacobian(double r2, + double* value, + double* jacobian); + +Again, the actual implementation of the function does not +matter. Interfacing this function with :code:`Affine2DWithDistortion` +is a two step process: + +1. Wrap :code:`ComputeDistortionValueAndJacobian` into a + :class:`CostFunction` object which we call + :code:`ComputeDistortionFunction`. +2. Wrap the resulting :class:`ComputeDistortionFunction` object using + :class:`CostFunctionToFunctor`. The resulting object is a functor + with a templated :code:`operator()` method, which pipes the + Jacobian computed by :class:`NumericDiffCostFunction` into the + approproate :code:`Jet` objects. + +The resulting code will look as follows: + +.. code-block:: c++ + :emphasize-lines: 21,22, 33 + + class ComputeDistortionFunction : public ceres::SizedCostFunction<1, 1> { + public: + virtual bool Evaluate(double const* const* parameters, + double* residuals, + double** jacobians) const { + if (!jacobians) { + ComputeDistortionValueAndJacobian(parameters[0][0], residuals, NULL); + } else { + ComputeDistortionValueAndJacobian(parameters[0][0], residuals, jacobians[0]); + } + return true; + } + }; + + struct Affine2DWithDistortion { + Affine2DWithDistortion(const double x_in[2], const double y_in[2]) { + x[0] = x_in[0]; + x[1] = x_in[1]; + y[0] = y_in[0]; + y[1] = y_in[1]; + compute_distortion.reset( + new ceres::CostFunctionToFunctor<1, 1>(new ComputeDistortionFunction)); + } + + template <typename T> + bool operator()(const T* theta, + const T* t, + T* residuals) const { + const T q_0 = cos(theta[0]) * x[0] - sin(theta[0]) * x[1] + t[0]; + const T q_1 = -sin(theta[0]) * x[0] + cos(theta[0]) * x[1] + t[1]; + const T r2 = q_0 * q_0 + q_1 * q_1; + T f; + (*compute_distortion)(&r2, &f); + residuals[0] = y[0] - f * q_0; + residuals[1] = y[1] - f * q_1; + return true; + } + + double x[2]; + double y[2]; + std::unique_ptr<ceres::CostFunctionToFunctor<1, 1> > compute_distortion; + }; + + +A function that is defined as a table of values +----------------------------------------------- + +The third and final case we will consider is where the function +:math:`f` is defined as a table of values on the interval :math:`[0, +100)`, with a value for each integer. + +.. code-block:: c++ + + vector<double> distortion_values; + +There are many ways of interpolating a table of values. Perhaps the +simplest and most common method is linear interpolation. But it is not +a great idea to use linear interpolation because the interpolating +function is not differentiable at the sample points. + +A simple (well behaved) differentiable interpolation is the `Cubic +Hermite Spline +<http://en.wikipedia.org/wiki/Cubic_Hermite_spline>`_. Ceres Solver +ships with routines to perform Cubic & Bi-Cubic interpolation that is +automatic differentiation friendly. + +Using Cubic interpolation requires first constructing a +:class:`Grid1D` object to wrap the table of values and then +constructing a :class:`CubicInterpolator` object using it. + +The resulting code will look as follows: + +.. code-block:: c++ + :emphasize-lines: 10,11,12,13, 24, 32,33 + + struct Affine2DWithDistortion { + Affine2DWithDistortion(const double x_in[2], + const double y_in[2], + const std::vector<double>& distortion_values) { + x[0] = x_in[0]; + x[1] = x_in[1]; + y[0] = y_in[0]; + y[1] = y_in[1]; + + grid.reset(new ceres::Grid1D<double, 1>( + &distortion_values[0], 0, distortion_values.size())); + compute_distortion.reset( + new ceres::CubicInterpolator<ceres::Grid1D<double, 1> >(*grid)); + } + + template <typename T> + bool operator()(const T* theta, + const T* t, + T* residuals) const { + const T q_0 = cos(theta[0]) * x[0] - sin(theta[0]) * x[1] + t[0]; + const T q_1 = -sin(theta[0]) * x[0] + cos(theta[0]) * x[1] + t[1]; + const T r2 = q_0 * q_0 + q_1 * q_1; + T f; + compute_distortion->Evaluate(r2, &f); + residuals[0] = y[0] - f * q_0; + residuals[1] = y[1] - f * q_1; + return true; + } + + double x[2]; + double y[2]; + std::unique_ptr<ceres::Grid1D<double, 1>> grid; + std::unique_ptr<ceres::CubicInterpolator<ceres::Grid1D<double, 1> >> compute_distortion; + }; + +In the above example we used :class:`Grid1D` and +:class:`CubicInterpolator` to interpolate a one dimensional table of +values. :class:`Grid2D` combined with :class:`CubicInterpolator` lets +the user to interpolate two dimensional tables of values. Note that +neither :class:`Grid1D` or :class:`Grid2D` are limited to scalar +valued functions, they also work with vector valued functions.
diff --git a/docs/source/numerical_derivatives.rst b/docs/source/numerical_derivatives.rst new file mode 100644 index 0000000..c52c039 --- /dev/null +++ b/docs/source/numerical_derivatives.rst
@@ -0,0 +1,403 @@ +.. default-domain:: cpp + +.. cpp:namespace:: ceres + +.. _chapter-numerical_derivatives: + +=================== +Numeric derivatives +=================== + +The other extreme from using analytic derivatives is to use numeric +derivatives. The key observation here is that the process of +differentiating a function :math:`f(x)` w.r.t :math:`x` can be written +as the limiting process: + +.. math:: + Df(x) = \lim_{h \rightarrow 0} \frac{f(x + h) - f(x)}{h} + + +Forward Differences +=================== + +Now of course one cannot perform the limiting operation numerically on +a computer so we do the next best thing, which is to choose a small +value of :math:`h` and approximate the derivative as + +.. math:: + Df(x) \approx \frac{f(x + h) - f(x)}{h} + + +The above formula is the simplest most basic form of numeric +differentiation. It is known as the *Forward Difference* formula. + +So how would one go about constructing a numerically differentiated +version of ``Rat43Analytic`` (`Rat43 +<http://www.itl.nist.gov/div898/strd/nls/data/ratkowsky3.shtml>`_) in +Ceres Solver. This is done in two steps: + + 1. Define *Functor* that given the parameter values will evaluate the + residual for a given :math:`(x,y)`. + 2. Construct a :class:`CostFunction` by using + :class:`NumericDiffCostFunction` to wrap an instance of + ``Rat43CostFunctor``. + +.. code-block:: c++ + + struct Rat43CostFunctor { + Rat43CostFunctor(const double x, const double y) : x_(x), y_(y) {} + + bool operator()(const double* parameters, double* residuals) const { + const double b1 = parameters[0]; + const double b2 = parameters[1]; + const double b3 = parameters[2]; + const double b4 = parameters[3]; + residuals[0] = b1 * pow(1.0 + exp(b2 - b3 * x_), -1.0 / b4) - y_; + return true; + } + + const double x_; + const double y_; + } + + CostFunction* cost_function = + new NumericDiffCostFunction<Rat43CostFunctor, FORWARD, 1, 4>( + new Rat43CostFunctor(x, y)); + +This is about the minimum amount of work one can expect to do to +define the cost function. The only thing that the user needs to do is +to make sure that the evaluation of the residual is implemented +correctly and efficiently. + +Before going further, it is instructive to get an estimate of the +error in the forward difference formula. We do this by considering the +`Taylor expansion <https://en.wikipedia.org/wiki/Taylor_series>`_ of +:math:`f` near :math:`x`. + +.. math:: + \begin{align} + f(x+h) &= f(x) + h Df(x) + \frac{h^2}{2!} D^2f(x) + + \frac{h^3}{3!}D^3f(x) + \cdots \\ + Df(x) &= \frac{f(x + h) - f(x)}{h} - \left [\frac{h}{2!}D^2f(x) + + \frac{h^2}{3!}D^3f(x) + \cdots \right]\\ + Df(x) &= \frac{f(x + h) - f(x)}{h} + O(h) + \end{align} + +i.e., the error in the forward difference formula is +:math:`O(h)` [#f4]_. + + +Implementation Details +---------------------- + +:class:`NumericDiffCostFunction` implements a generic algorithm to +numerically differentiate a given functor. While the actual +implementation of :class:`NumericDiffCostFunction` is complicated, the +net result is a :class:`CostFunction` that roughly looks something +like the following: + +.. code-block:: c++ + + class Rat43NumericDiffForward : public SizedCostFunction<1,4> { + public: + Rat43NumericDiffForward(const Rat43Functor* functor) : functor_(functor) {} + virtual ~Rat43NumericDiffForward() {} + virtual bool Evaluate(double const* const* parameters, + double* residuals, + double** jacobians) const { + functor_(parameters[0], residuals); + if (!jacobians) return true; + double* jacobian = jacobians[0]; + if (!jacobian) return true; + + const double f = residuals[0]; + double parameters_plus_h[4]; + for (int i = 0; i < 4; ++i) { + std::copy(parameters, parameters + 4, parameters_plus_h); + const double kRelativeStepSize = 1e-6; + const double h = std::abs(parameters[i]) * kRelativeStepSize; + parameters_plus_h[i] += h; + double f_plus; + functor_(parameters_plus_h, &f_plus); + jacobian[i] = (f_plus - f) / h; + } + return true; + } + + private: + scoped_ptr<Rat43Functor> functor_; + }; + + +Note the choice of step size :math:`h` in the above code, instead of +an absolute step size which is the same for all parameters, we use a +relative step size of :math:`\text{kRelativeStepSize} = 10^{-6}`. This +gives better derivative estimates than an absolute step size [#f2]_ +[#f3]_. This choice of step size only works for parameter values that +are not close to zero. So the actual implementation of +:class:`NumericDiffCostFunction`, uses a more complex step size +selection logic, where close to zero, it switches to a fixed step +size. + + +Central Differences +=================== + +:math:`O(h)` error in the Forward Difference formula is okay but not +great. A better method is to use the *Central Difference* formula: + +.. math:: + Df(x) \approx \frac{f(x + h) - f(x - h)}{2h} + +Notice that if the value of :math:`f(x)` is known, the Forward +Difference formula only requires one extra evaluation, but the Central +Difference formula requires two evaluations, making it twice as +expensive. So is the extra evaluation worth it? + +To answer this question, we again compute the error of approximation +in the central difference formula: + +.. math:: + \begin{align} + f(x + h) &= f(x) + h Df(x) + \frac{h^2}{2!} + D^2f(x) + \frac{h^3}{3!} D^3f(x) + \frac{h^4}{4!} D^4f(x) + \cdots\\ + f(x - h) &= f(x) - h Df(x) + \frac{h^2}{2!} + D^2f(x) - \frac{h^3}{3!} D^3f(c_2) + \frac{h^4}{4!} D^4f(x) + + \cdots\\ + Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + \frac{h^2}{3!} + D^3f(x) + \frac{h^4}{5!} + D^5f(x) + \cdots \\ + Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + O(h^2) + \end{align} + +The error of the Central Difference formula is :math:`O(h^2)`, i.e., +the error goes down quadratically whereas the error in the Forward +Difference formula only goes down linearly. + +Using central differences instead of forward differences in Ceres +Solver is a simple matter of changing a template argument to +:class:`NumericDiffCostFunction` as follows: + +.. code-block:: c++ + + CostFunction* cost_function = + new NumericDiffCostFunction<Rat43CostFunctor, CENTRAL, 1, 4>( + new Rat43CostFunctor(x, y)); + +But what do these differences in the error mean in practice? To see +this, consider the problem of evaluating the derivative of the +univariate function + +.. math:: + f(x) = \frac{e^x}{\sin x - x^2}, + +at :math:`x = 1.0`. + +It is easy to determine that :math:`Df(1.0) = +140.73773557129658`. Using this value as reference, we can now compute +the relative error in the forward and central difference formulae as a +function of the absolute step size and plot them. + +.. figure:: forward_central_error.png + :figwidth: 100% + :align: center + +Reading the graph from right to left, a number of things stand out in +the above graph: + + 1. The graph for both formulae have two distinct regions. At first, + starting from a large value of :math:`h` the error goes down as + the effect of truncating the Taylor series dominates, but as the + value of :math:`h` continues to decrease, the error starts + increasing again as roundoff error starts to dominate the + computation. So we cannot just keep on reducing the value of + :math:`h` to get better estimates of :math:`Df`. The fact that we + are using finite precision arithmetic becomes a limiting factor. + 2. Forward Difference formula is not a great method for evaluating + derivatives. Central Difference formula converges much more + quickly to a more accurate estimate of the derivative with + decreasing step size. So unless the evaluation of :math:`f(x)` is + so expensive that you absolutely cannot afford the extra + evaluation required by central differences, **do not use the + Forward Difference formula**. + 3. Neither formula works well for a poorly chosen value of :math:`h`. + + +Ridders' Method +=============== + +So, can we get better estimates of :math:`Df` without requiring such +small values of :math:`h` that we start hitting floating point +roundoff errors? + +One possible approach is to find a method whose error goes down faster +than :math:`O(h^2)`. This can be done by applying `Richardson +Extrapolation +<https://en.wikipedia.org/wiki/Richardson_extrapolation>`_ to the +problem of differentiation. This is also known as *Ridders' Method* +[Ridders]_. + +Let us recall, the error in the central differences formula. + +.. math:: + \begin{align} + Df(x) & = \frac{f(x + h) - f(x - h)}{2h} + \frac{h^2}{3!} + D^3f(x) + \frac{h^4}{5!} + D^5f(x) + \cdots\\ + & = \frac{f(x + h) - f(x - h)}{2h} + K_2 h^2 + K_4 h^4 + \cdots + \end{align} + +The key thing to note here is that the terms :math:`K_2, K_4, ...` +are indepdendent of :math:`h` and only depend on :math:`x`. + +Let us now define: + +.. math:: + + A(1, m) = \frac{f(x + h/2^{m-1}) - f(x - h/2^{m-1})}{2h/2^{m-1}}. + +Then observe that + +.. math:: + + Df(x) = A(1,1) + K_2 h^2 + K_4 h^4 + \cdots + +and + +.. math:: + + Df(x) = A(1, 2) + K_2 (h/2)^2 + K_4 (h/2)^4 + \cdots + +Here we have halved the step size to obtain a second central +differences estimate of :math:`Df(x)`. Combining these two estimates, +we get: + +.. math:: + + Df(x) = \frac{4 A(1, 2) - A(1,1)}{4 - 1} + O(h^4) + +which is an approximation of :math:`Df(x)` with truncation error that +goes down as :math:`O(h^4)`. But we do not have to stop here. We can +iterate this process to obtain even more accurate estimates as +follows: + +.. math:: + + A(n, m) = \begin{cases} + \frac{\displaystyle f(x + h/2^{m-1}) - f(x - + h/2^{m-1})}{\displaystyle 2h/2^{m-1}} & n = 1 \\ + \frac{\displaystyle 4^{n-1} A(n - 1, m + 1) - A(n - 1, m)}{\displaystyle 4^{n-1} - 1} & n > 1 + \end{cases} + +It is straightforward to show that the approximation error in +:math:`A(n, 1)` is :math:`O(h^{2n})`. To see how the above formula can +be implemented in practice to compute :math:`A(n,1)` it is helpful to +structure the computation as the following tableau: + +.. math:: + \begin{array}{ccccc} + A(1,1) & A(1, 2) & A(1, 3) & A(1, 4) & \cdots\\ + & A(2, 1) & A(2, 2) & A(2, 3) & \cdots\\ + & & A(3, 1) & A(3, 2) & \cdots\\ + & & & A(4, 1) & \cdots \\ + & & & & \ddots + \end{array} + +So, to compute :math:`A(n, 1)` for increasing values of :math:`n` we +move from the left to the right, computing one column at a +time. Assuming that the primary cost here is the evaluation of the +function :math:`f(x)`, the cost of computing a new column of the above +tableau is two function evaluations. Since the cost of evaluating +:math:`A(1, n)`, requires evaluating the central difference formula +for step size of :math:`2^{1-n}h` + +Applying this method to :math:`f(x) = \frac{e^x}{\sin x - x^2}` +starting with a fairly large step size :math:`h = 0.01`, we get: + +.. math:: + \begin{array}{rrrrr} + 141.678097131 &140.971663667 &140.796145400 &140.752333523 &140.741384778\\ + &140.736185846 &140.737639311 &140.737729564 &140.737735196\\ + & &140.737736209 &140.737735581 &140.737735571\\ + & & &140.737735571 &140.737735571\\ + & & & &140.737735571\\ + \end{array} + +Compared to the *correct* value :math:`Df(1.0) = 140.73773557129658`, +:math:`A(5, 1)` has a relative error of :math:`10^{-13}`. For +comparison, the relative error for the central difference formula with +the same stepsize (:math:`0.01/2^4 = 0.000625`) is :math:`10^{-5}`. + +The above tableau is the basis of Ridders' method for numeric +differentiation. The full implementation is an adaptive scheme that +tracks its own estimation error and stops automatically when the +desired precision is reached. Of course it is more expensive than the +forward and central difference formulae, but is also significantly +more robust and accurate. + +Using Ridder's method instead of forward or central differences in +Ceres is again a simple matter of changing a template argument to +:class:`NumericDiffCostFunction` as follows: + +.. code-block:: c++ + + CostFunction* cost_function = + new NumericDiffCostFunction<Rat43CostFunctor, RIDDERS, 1, 4>( + new Rat43CostFunctor(x, y)); + +The following graph shows the relative error of the three methods as a +function of the absolute step size. For Ridders's method we assume +that the step size for evaluating :math:`A(n,1)` is :math:`2^{1-n}h`. + +.. figure:: forward_central_ridders_error.png + :figwidth: 100% + :align: center + +Using the 10 function evaluations that are needed to compute +:math:`A(5,1)` we are able to approximate :math:`Df(1.0)` about a 1000 +times better than the best central differences estimate. To put these +numbers in perspective, machine epsilon for double precision +arithmetic is :math:`\approx 2.22 \times 10^{-16}`. + +Going back to ``Rat43``, let us also look at the runtime cost of the +various methods for computing numeric derivatives. + +========================== ========= +CostFunction Time (ns) +========================== ========= +Rat43Analytic 255 +Rat43AnalyticOptimized 92 +Rat43NumericDiffForward 262 +Rat43NumericDiffCentral 517 +Rat43NumericDiffRidders 3760 +========================== ========= + +As expected, Central Differences is about twice as expensive as +Forward Differences and the remarkable accuracy improvements of +Ridders' method cost an order of magnitude more runtime. + +Recommendations +=============== + +Numeric differentiation should be used when you cannot compute the +derivatives either analytically or using automatic differention. This +is usually the case when you are calling an external library or +function whose analytic form you do not know or even if you do, you +are not in a position to re-write it in a manner required to use +:ref:`chapter-automatic_derivatives`. + + +When using numeric differentiation, use at least Central Differences, +and if execution time is not a concern or the objective function is +such that determining a good static relative step size is hard, +Ridders' method is recommended. + +.. rubric:: Footnotes + +.. [#f2] `Numerical Differentiation + <https://en.wikipedia.org/wiki/Numerical_differentiation#Practical_considerations_using_floating_point_arithmetic>`_ +.. [#f3] [Press]_ Numerical Recipes, Section 5.7 +.. [#f4] In asymptotic error analysis, an error of :math:`O(h^k)` + means that the absolute-value of the error is at most some + constant times :math:`h^k` when :math:`h` is close enough to + :math:`0`.
diff --git a/docs/source/spivak_notation.rst b/docs/source/spivak_notation.rst new file mode 100644 index 0000000..3ac56ba --- /dev/null +++ b/docs/source/spivak_notation.rst
@@ -0,0 +1,53 @@ +.. default-domain:: cpp + +.. cpp:namespace:: ceres + +.. _chapter-spivak_notation: + +=============== +Spivak Notation +=============== + +To preserve our collective sanities, we will use Spivak's notation for +derivatives. It is a functional notation that makes reading and +reasoning about expressions involving derivatives simple. + +For a univariate function :math:`f`, :math:`f(a)` denotes its value at +:math:`a`. :math:`Df` denotes its first derivative, and +:math:`Df(a)` is the derivative evaluated at :math:`a`, i.e + +.. math:: + Df(a) = \left . \frac{d}{dx} f(x) \right |_{x = a} + +:math:`D^kf` denotes the :math:`k^{\text{th}}` derivative of :math:`f`. + +For a bi-variate function :math:`g(x,y)`. :math:`D_1g` and +:math:`D_2g` denote the partial derivatives of :math:`g` w.r.t the +first and second variable respectively. In the classical notation this +is equivalent to saying: + +.. math:: + + D_1 g = \frac{\partial}{\partial x}g(x,y) \text{ and } D_2 g = \frac{\partial}{\partial y}g(x,y). + + +:math:`Dg` denotes the Jacobian of `g`, i.e., + +.. math:: + + Dg = \begin{bmatrix} D_1g & D_2g \end{bmatrix} + +More generally for a multivariate function :math:`g:\mathbb{R}^n +\longrightarrow \mathbb{R}^m`, :math:`Dg` denotes the :math:`m\times +n` Jacobian matrix. :math:`D_i g` is the partial derivative of +:math:`g` w.r.t the :math:`i^{\text{th}}` coordinate and the +:math:`i^{\text{th}}` column of :math:`Dg`. + +Finally, :math:`D^2_1g` and :math:`D_1D_2g` have the obvious meaning +as higher order partial derivatives. + +For more see Michael Spivak's book `Calculus on Manifolds +<https://www.amazon.com/Calculus-Manifolds-Approach-Classical-Theorems/dp/0805390219>`_ +or a brief discussion of the `merits of this notation +<http://www.vendian.org/mncharity/dir3/dxdoc/>`_ by +Mitchell N. Charity.