Improve threading in covariance.

Covariance computation wants to do a triangular iteration but as a
single loop. Right now it iterates over a square and does nothing half
the time, which is inefficient and has bad worst-case threading
performance. This adds a utility that allows waste-free linear iteration
over a triangle.

Change-Id: I881d5683c65882f87dc2b5f8449a855d22ace755
diff --git a/BUILD b/BUILD
index 25949c1..4018a68 100644
--- a/BUILD
+++ b/BUILD
@@ -123,6 +123,7 @@
     "numeric_diff_cost_function",
     "ordered_groups",
     "parallel_for",
+    "parallel_utils",
     "parameter_block_ordering",
     "parameter_block",
     "partitioned_matrix_view",