Add a section on implicit and inverse function theorems

Change-Id: I0e6c7d2850a33d03aa629579f049ad44a7618621
diff --git a/docs/source/derivatives.rst b/docs/source/derivatives.rst
index bff6a29..d9a52b0 100644
--- a/docs/source/derivatives.rst
+++ b/docs/source/derivatives.rst
@@ -58,3 +58,4 @@
    numerical_derivatives
    automatic_derivatives
    interfacing_with_autodiff
+   inverse_and_implicit_function_theorems
diff --git a/docs/source/inverse_and_implicit_function_theorems.rst b/docs/source/inverse_and_implicit_function_theorems.rst
new file mode 100644
index 0000000..7d8f7fa
--- /dev/null
+++ b/docs/source/inverse_and_implicit_function_theorems.rst
@@ -0,0 +1,214 @@
+.. default-domain:: cpp
+
+.. cpp:namespace:: ceres
+
+.. _chapter-inverse_function_theorem:
+
+==========================================
+Using Inverse & Implicit Function Theorems
+==========================================
+
+Until now we have considered methods for computing derivatives that
+work directly on the function being differentiated. However, this is
+not always possible. For example, if the function can only be computed
+via an iterative algorithm, or there is no explicit definition of the
+function available.  In this section we will see how we can use two
+basic results from calculus to get around these difficulties.
+
+
+Inverse Function Theorem
+========================
+
+Suppose we wish to evaluate the derivative of a function :math:`f(x)`,
+but evaluating :math:`f(x)` is not easy. Say it involves running an
+iterative algorithm. You could try automatically differentiating the
+iterative algorithm, but even if that is possible, it can become quite
+expensive.
+
+In some cases we get lucky, and computing the inverse of :math:`f(x)`
+is an easy operation. In these cases, we can use the `Inverse Function
+Theorem <http://en.wikipedia.org/wiki/Inverse_function_theorem>`_ to
+compute the derivative exactly. Here is the key idea:
+
+Assuming that :math:`y=f(x)` is continuously differentiable in a
+neighborhood of a point :math:`x` and :math:`Df(x)` is the invertible
+Jacobian of :math:`f` at :math:`x`, then by applying the chain rule to
+the identity :math:`f^{-1}(f(x)) = x`, we have
+:math:`Df^{-1}(f(x))Df(x) = I`, or :math:`Df^{-1}(y) = (Df(x))^{-1}`,
+i.e., the Jacobian of :math:`f^{-1}` is the inverse of the Jacobian of
+:math:`f`, or :math:`Df(x) = (Df^{-1}(y))^{-1}`.
+
+For example, let :math:`f(x) = e^x`. Now of course we know that
+:math:`Df(x) = e^x`, but let's try and compute it via the Inverse
+Function Theorem. For :math:`x > 0`, we have :math:`f^{-1}(y) = \log
+y`, so :math:`Df^{-1}(y) = \frac{1}{y}`, so :math:`Df(x) =
+(Df^{-1}(y))^{-1} = y = e^x`.
+
+You maybe wondering why the above is true. A smoothly differentiable
+function in a small neighborhood is well approximated by a linear
+function. Indeed this is a good way to think about the Jacobian, it is
+the matrix that best approximates the function linearly. Once you do
+that, it is straightforward to see that *locally* :math:`f^{-1}(y)` is
+best approximated linearly by the inverse of the Jacobian of
+:math:`f(x)`.
+
+Let us now consider a more practical example.
+
+Geodetic Coordinate System Conversion
+-------------------------------------
+
+When working with data related to the Earth, one can use two different
+coordinate systems. The familiar (latitude, longitude, height)
+Latitude-Longitude-Altitude coordinate system or the `ECEF
+<http://en.wikipedia.org/wiki/ECEF>`_ coordinate systems. The former
+is familiar but is not terribly convenient analytically. The latter is
+a Cartesian system but not particularly intuitive. So systems that
+process earth related data have to go back and forth between these
+coordinate systems.
+
+The conversion between the LLA and the ECEF coordinate system requires
+a model of the Earth, the most commonly used one being `WGS84
+<https://en.wikipedia.org/wiki/World_Geodetic_System#1984_version>`_.
+
+Going from the spherical :math:`(\phi,\lambda,h)` to the ECEF
+:math:`(x,y,z)` coordinates is easy.
+
+.. math::
+
+   \chi &= \sqrt{1 - e^2 \sin^2 \phi}
+
+   X &= \left( \frac{a}{\chi} + h \right) \cos \phi \cos \lambda
+
+   Y &= \left( \frac{a}{\chi} + h \right) \cos \phi \sin \lambda
+
+   Z &= \left(\frac{a(1-e^2)}{\chi}  +h \right) \sin \phi
+
+Here :math:`a` and :math:`e^2` are constants defined by `WGS84
+<https://en.wikipedia.org/wiki/World_Geodetic_System#1984_version>`_.
+
+Going from ECEF to LLA coordinates requires an iterative algorithm. So
+to compute the derivative of the this transformation we invoke the
+Inverse Function Theorem as follows:
+
+.. code-block:: c++
+
+   Eigen::Vector3d ecef; // Fill some values
+   // Iterative computation.
+   Eigen::Vector3d lla = ECEFToLLA(ecef);
+   // Analytic derivatives
+   Eigen::Matrix3d lla_to_ecef_jacobian = LLAToECEFJacobian(lla);
+   bool invertible;
+   Eigen::Matrix3d ecef_to_lla_jacobian;
+   lla_to_ecef_jacobian.computeInverseWithCheck(ecef_to_lla_jacobian, invertible);
+
+
+Implicit Function Theorem
+=========================
+
+Consider now the problem where we have two variables :math:`x \in
+\mathbb{R}^m` and :math:`y \in \mathbb{R}^n` and a function
+:math:`F:\mathbb{R}^m \times \mathbb{R}^n \rightarrow \mathbb{R}^n`
+such that :math:`F(x,y) = 0` and we wish to calculate the Jacobian of
+:math:`y` with respect to `x`. How do we do this?
+
+If for a given value of :math:`(x,y)`, the partial Jacobian
+:math:`D_2F(x,y)` is full rank, then the `Implicit Function Theorem
+<https://en.wikipedia.org/wiki/Implicit_function_theorem>`_ tells us
+that there exists a neighborhood of :math:`x` and a function :math:`G`
+such :math:`y = G(x)` in this neighborhood. Differentiating
+:math:`F(x,G(x)) = 0` gives us
+
+.. math::
+
+   D_1F(x,y) + D_2F(x,y)DG(x) &= 0
+
+                        DG(x) &= -(D_2F(x,y))^{-1} D_1 F(x,y)
+
+                        D y(x) &= -(D_2F(x,y))^{-1} D_1 F(x,y)
+
+This means that we can compute the derivative of :math:`y` with
+respect to :math:`x` by multiplying the Jacobian of :math:`F` w.r.t
+:math:`x` by the inverse of the Jacobian of :math:`F` w.r.t :math:`y`.
+
+Let's consider two examples.
+
+Roots of a Polynomial
+---------------------
+
+The first example we consider is a classic. Let :math:`p(x) = a_0 +
+a_1 x + \dots + a_n x^n` be a degree :math:`n` polynomial, and we wish
+to compute the derivative of its roots with respect to its
+coefficients. There is no closed form formula for computing the roots
+of a general degree :math:`n` polynomial. `Galois
+<https://en.wikipedia.org/wiki/%C3%89variste_Galois>`_ and `Abel
+<https://en.wikipedia.org/wiki/Niels_Henrik_Abel>`_ proved that. There
+are numerical algorithms like computing the eigenvalues of the
+`Companion Matrix
+<https://nhigham.com/2021/03/23/what-is-a-companion-matrix/>`_, but
+differentiating an eigenvalue solver does not seem like fun. But the
+Implicit Function Theorem offers us a simple path.
+
+If :math:`x` is a root of :math:`p(x)`, then :math:`F(\mathbf{a}, x) =
+a_0 + a_1 x + \dots + a_n x^n = 0`. So,
+
+.. math::
+
+   D_1 F(\mathbf{a}, x) &= [1, x, x^2, \dots, x^n]
+
+   D_2 F(\mathbf{a}, x) &= \sum_{k=1}^n k a_k x^{k-1} = Dp(x)
+
+        Dx(a) &= \frac{-1}{Dp(x)} [1, x, x^2, \dots, x^n]
+
+Differentiating the Solution to an Optimization Problem
+-------------------------------------------------------
+
+Sometimes we are required to solve optimization problems inside
+optimization problems, and this requires computing the derivative of
+the optimal solution (or a fixed point) of an optimization problem
+w.r.t its parameters.
+
+Let :math:`\theta \in \mathbb{R}^m` be a vector, :math:`A(\theta) \in
+\mathbb{R}^{k\times n}` be a matrix whose entries are a function of
+:math:`\theta` with :math:`k \ge n` and let :math:`b \in \mathbb{R}^k`
+be a constant vector, then consider the linear least squares problem:
+
+.. math::
+
+   x^* = \arg \min_x \|A(\theta) x - b\|_2^2
+
+How do we compute :math:`D_\theta x^*(\theta)`?
+
+One approach would be to observe that :math:`x^*(\theta) =
+(A^\top(\theta)A(\theta))^{-1}A^\top(\theta)b` and then differentiate
+this w.r.t :math:`\theta`. But this would require differentiating
+through the inverse of the matrix
+:math:`(A^\top(\theta)A(\theta))^{-1}`. Not exactly easy. Let's use
+the Implicit Function Theorem instead.
+
+The first step is to observe that :math:`x^*` satisfies the so called
+*normal equations*.
+
+.. math::
+
+   A^\top(\theta)A(\theta)x^* - A^\top(\theta)b = 0
+
+We will compute :math:`D_\theta x^*` column-wise, treating
+:math:`A(\theta)` as a function of one coordinate (:math:`\theta_i`)
+of :math:`\theta` at a time. So using the normal equations, let's
+define :math:`F(\theta_i, x^*) = A^\top(\theta_i)A(\theta_i)x^* -
+A^\top(\theta_i)b = 0`. Using which can now compute:
+
+.. math::
+
+   D_1F(\theta_i, x^*) &= D_{\theta_i}A^\top A + A^\top
+   D_{\theta_i}Ax^* - D_{\theta_i} A^\top b = g_i
+
+   D_2F(\theta_i, x^*) &= A^\top A
+
+   Dx^*(\theta_i) & = -(A^\top A)^{-1} g_i
+
+   Dx^*(\theta) & = -(A^\top A )^{-1} \left[g_1, \dots, g_m\right]
+
+Observe that we only need to compute the inverse of :math:`A^\top A`,
+to compute :math:`D x^*(\theta)`, which we needed anyways to compute
+:math:`x^*`.
diff --git a/docs/source/modeling_faqs.rst b/docs/source/modeling_faqs.rst
index a0c8f2f..a2704d8 100644
--- a/docs/source/modeling_faqs.rst
+++ b/docs/source/modeling_faqs.rst
@@ -86,49 +86,3 @@
 #. How do I set one or more components of a parameter block constant?
 
    Using :class:`SubsetParameterization`.
-
-#. Putting `Inverse Function Theorem
-   <http://en.wikipedia.org/wiki/Inverse_function_theorem>`_ to use.
-
-   Every now and then we have to deal with functions which cannot be
-   evaluated analytically. Computing the Jacobian in such cases is
-   tricky. A particularly interesting case is where the inverse of the
-   function is easy to compute analytically. An example of such a
-   function is the Coordinate transformation between the `ECEF
-   <http://en.wikipedia.org/wiki/ECEF>`_ and the `WGS84
-   <http://en.wikipedia.org/wiki/World_Geodetic_System>`_ where the
-   conversion from WGS84 to ECEF is analytic, but the conversion
-   back to WGS84 uses an iterative algorithm. So how do you compute the
-   derivative of the ECEF to WGS84 transformation?
-
-   One obvious approach would be to numerically
-   differentiate the conversion function. This is not a good idea. For
-   one, it will be slow, but it will also be numerically quite
-   bad.
-
-   Turns out you can use the `Inverse Function Theorem
-   <http://en.wikipedia.org/wiki/Inverse_function_theorem>`_ in this
-   case to compute the derivatives more or less analytically.
-
-   The key result here is. If :math:`x = f^{-1}(y)`, and :math:`Df(x)`
-   is the invertible Jacobian of :math:`f` at :math:`x`. Then the
-   Jacobian :math:`Df^{-1}(y) = [Df(x)]^{-1}`, i.e., the Jacobian of
-   the :math:`f^{-1}` is the inverse of the Jacobian of :math:`f`.
-
-   Algorithmically this means that given :math:`y`, compute :math:`x =
-   f^{-1}(y)` by whatever means you can. Evaluate the Jacobian of
-   :math:`f` at :math:`x`. If the Jacobian matrix is invertible, then
-   its inverse is the Jacobian of :math:`f^{-1}(y)` at  :math:`y`.
-
-   One can put this into practice with the following code fragment.
-
-   .. code-block:: c++
-
-      Eigen::Vector3d ecef; // Fill some values
-      // Iterative computation.
-      Eigen::Vector3d lla = ECEFToLLA(ecef);
-      // Analytic derivatives
-      Eigen::Matrix3d lla_to_ecef_jacobian = LLAToECEFJacobian(lla);
-      bool invertible;
-      Eigen::Matrix3d ecef_to_lla_jacobian;
-      lla_to_ecef_jacobian.computeInverseWithCheck(ecef_to_lla_jacobian, invertible);