Sunday, January 31, 2016

On calling Ceres from Scala - 5.0: Autodiffin'it with spire

This is the sixth installment of a series documenting my progress on porting the Ceres solver API to Scala. All my sources quoted below are available on github.

Autodiffin'it with spire


One of the most useful features of the Ceres solver is its support for the automatic differentiation (in forward-mode) of vector cost functions of multiple vector arguments. This is entirely implemented caller-side with respect to the optimizer, through the use of a pair of templated adaptors (high and low-level ones). They work together to "convert" a Ceres cost function into a templated functor, written by the user, that the can be called with one of two different types of arguments: ordinary double-valued parameter blocks for just computing the values of the residuals, and parameter blocks whose elements are Jet-valued for computing both the residuals and their partial derivatives with respect to the parameters. 

So, the optimizer only "sees" an ordinary cost function, but:
  • When it calls it with a NULL jacobian matrix, the adapter chooses the double-only specialization of the user-provided cost functor, and returns the straight residuals to the optimizer. 
  • Otherwise, it packs the parameter values into Jet's, calls the Jet-typed specialization of the functor, and then unpacks the result into residuals and jacobians.

Unfortunately none of these caller-side goodies can directly be used in our SWIG-based port, since SWIG cannot specialize the C++ templates for as-yet undefined implementations of the cost functors, to be provided by the users (or, to repeat the usual mantra, Java generics are not the same as C++ templates). Sad!

However, we can take advantage of Ceres's clean separation of the autodiff implementation from the core optimizer code. Autodiff is math, and there are plenty of excellent numerical math libraries for the JVM. A particularly good one in the Scala world is spire, whose authors have gone through extreme lengths toward providing powerful type-safe mathematical abstractions while sacrificing little or nothing in the way of execution performance. This article and this YouTube video provide good introductions to the library.

Some time ago, in preparation for this port, I contributed to spire an implementation of the Jet type that seamlessly integrates with the rest of spire's algebraic types as well as its generic implementations of the standard math libraries. Now we get to use it. 

The basic idea is to replicate in a bit of Scala code the with/without jacobian code switch described above. Let's first define a CostFunctor base class, parametrized on the argument type, which the user can later override with their particular implementation of a residual.


This defines a function representing a cost term of dimension "kNumResiduals", dependent on a variadic number of parameter blocks, each of size N(0), N(1), etc., so that the parameter space is composed of N.length parameter blocks. The parameter blocks themselves are passed to the cost evaluation method as Array's parametrized on a type T which, through the given typeclasses, allow for algebraic calculations on their components.

The conversion method "toAutodiffCostFunction" produces the actual auto-differentiable cost function. It is defined as follows:


There is quite a bit of code there, but it should be fairly easy to parse.

First I define, following Ceres, a SizedCostFunction that makes explicit the input and output dimensions of a CostFunction, then I extend it into an AutodiffCostFunction, which is constructed from a given CostFunctor (hence with known such dimensions).

The evaluate method of AutodiffCostFunction implements the CostFunction interface. Note that it consists of two main branches, respectively for the case of pure cost evaluation (no jacobians), and cost with partial derivatives. In the second case I "convert" the input parameter blocks into Jet's set up for computing the associated derivatives, and then map back the functor's output to residuals and jacobians.

A bit of extra complication, present in the Ceres C++ code as well, is due to the fact that we want to differentiate a function of several separate blocks, whereas the Jet framework is designed for computing partial derivatives with respect to variables represented in a single container. Therefore we must "unroll" the blocks prior to calling the cost functor, and manage the variable indices accordingly.

Testing it


We can test all the above by re-implementing in Scala the Powell function example from the Ceres distribution:


Running it we get the same results as the C++ version:

As before, the timing does not reflect JIT optimizations.

Next steps


Getting auto-differentiation was the last nontrivial step of this port. From now on it's just "plumbing". My next post will point to a first version of the entire thing.

Monday, January 18, 2016

On calling Ceres from Scala - 4.2: Hello World, a brief note on speed

This is the sixth installment of a series documenting my progress on porting the Ceres solver API to Scala. All my sources quoted below are available on github.

Hello World, A Brief Note on Speed


We all know how, in computer science, speed is often reported in terms of lies, big lies and benchmarks. However, silly benchmarks are fun, so let's write one.

On my 2.3 GHz i7 mac, the ceres-reported timings for a run of the native C++ implementation of the "helloworld" example are:


My Scala implementation based on the SWIG-generated wrap (HelloWorld.scala) instead clocks at:

This is not an apple-to-apple comparison, as the C++ version's cost function uses automatic differentiation, but the AD overhead should be in the noise.

So Scala is about 3x slower in terms of iteration time, and about 53x slower in total time, and that's OMG horrible right?

Well, not quite. The native version is statically optimized ("c++ -O3") in its entirety, whereas the port side calls JVM routines at every cost function evaluation. So, to be fair, we ought to give the JIT a chance to work its magic.

Running the optimization problem Scala-side 10 times in a row in the same process, but reinitializing all data structures at every run, should give us a better idea of the JIT-induced speedups.


Note the rather dramatic speedup after the first iteration, and the eventual convergence to essentially the same runtime as the native implementation.

This is, of course, just a toy benchmark, but the short answer to "Yo, how much slower is this port going to be compared to native?" should be (a) there is no short answer and (b) tell me more about your use case.

Sunday, January 17, 2016

On calling Ceres from Scala - 4.1: Hello World explained

This is the fifth installment of a series documenting my progress on porting the Ceres solver API to Scala. All my sources quoted below are available on github.

Hello World Explained


In this post I'll go through the nontrivial parts of my first SWIG wrap of the ceres-solver library. The code is in the sandbox section of the skeres github repo, and includes a port to pure Scala of Ceres's helloworld.cc example.

Build

The library and example are built using a simple Bourne shell script (we'll tackle sbt configuration later). The ceres-library headers of interest are swigged into a com.google.ceres Java package, and the JNI auto-generated .cxx source file is produced in the toplevel directory, as is the shared library (.dylib file in Mac OSX) they are eventually linked into. The ported example itself belongs to a separate org.somelightprojections.skeres package.

Code

The example's code closely follows the C++ original, with one basic difference: the cost function derivative is computed analytically, rather than through automatic differentiation, as the latter will be the subject of the next step in this port. Therefore the residual term code looks like this:

As you can see, it follows closely the C++ implementation (i.e. what the C++ codes would look like if it didn't use autodiff): it describes the size of its parameter and residual blocks, and then overrides the Evaluate method to actually compute the cost. Note the use of camelCase-style identifiers, in keeping with the usual Scala style conventions.

Probably the only part deserving an explanation is the manipulation of the residual array and of the parameter and jacobian matrices. This takes advantage of a very simple C++ helper declared inline in the ceres.i SWIG configuration file:

Remember the discussion in a previous post about the use of the carrays.i SWIG library. The %array_class stanza declares DoubleArray to be a class wrapping an array of double's. The inline DoubleMatrix helper exposes a null test and access to the matrix by rows. Thanks to the magic of implicit conversions, this is all we need to make these low-level classes behave in a more idiomatic Scala way.

First, for one dimensional arrays of doubles we make use of the enrichment pattern to add the usual get/set accessors, as well as conversions of DoubleArray objects to pointers and Scala arrays:


We use a similar enrichment for matrices:

Note the call to the (swig-wrapped) DoubleMatrix helper to access a matrix row. Note also that these calls involve only tiny JVM object creations, with no copies of the data arrays themselves.

Finally, we define implicit conversions to instantiate these enrichments from the swig-wrapped low-level pointers:

Odds and ends

The ceres.i configuration file only includes as many ceres headers as are needed to compile the example. In addition, however, it also exports all the ceres-predefined "loss function" classes that are used to robustify the residuals. I mention this here to highlight another feature provided by SWIG, namely ownership transfer (i.e. memory leak avoidance). The relevant stanza looks like this:

The inline-declared struct allows us to create loss function instances from Scala using calls like PredefinedLossFunctions.huberLoss(2.0).  The %newobject declarations cause the destructors for the newly created-and-wrapped C++ objects to be called whenever the wrapping Java object is garbage-collected.

A similar issue is addressed in the HelloWorld.scala source where the Problem object is declared:


Here we use the (wrapped) ProblemOptions to instruct Ceres not to take ownership of the cost and loss function objects. This is necessary because they are JVM-owned objects, and the default behavior in which the ceres Problem instance takes ownership and deletes them at in its destructor would likely cause a double-free heap corruption.

Saturday, January 9, 2016

On calling Ceres from Scala - 4.0: Hello World

This is the fourth installment of a series documenting my progress on porting the Ceres solver API to Scala. All my sources quoted below are available on github.


Hello World


We are now ready for our first-cut SWIG wrap of the ceres-solver library. The code is in the sandbox section of the skeres github repo, and includes a port to pure Scala of Ceres's helloworld.cc example.

Take a look at the sources, if you wish. I am pressed for time tonight, so will comment on them in a subsequent post. However, I want to post the output of a run just to show it's real:

On calling Ceres from Scala - 3.0: Who owns what?

This is the third installment of a series documenting my progress on porting the Ceres solver API to Scala. All my sources quoted below are available on github.


Who owns what?


The pesky issue of memory management rears its head whenever code must cross the boundary between JVM  and native code. Items to be dealt with include:
  • Are copies of the same data kept on both sides? If yes, are we wasting valuable RAM (thus handicapping our ability to scale to large problems)? If no, are we wasting computing time making repeated expensive calls across the JNI boundary?
  • If memory is heap-allocated on the native side to hold the data, is it properly reclaimed to avoid leaks - including in the case of failures on the native side that leave the JVM side running?
  • If the native side accesses memory allocated by the JVM, are the Java-side references reachable throughout the runtime of the native code? Otherwise the latter may try to access memory that's already been garbage-collected.
  • Is the JVM properly initialized? Depending on the expected data split between Java and native size of the programs, the max heap size will need to be properly sized with respect to the overall max process size.
SWIG is actually pretty good at keeping track of C++ allocations caused by Java, in that Java classes wrapping C++ ones are generated with a finalize method that calls operator delete on the pointer to the wrapped C++ instance. Therefore garbage collection will help avoid leaks (with the usual caveat that there are no guarantees on when finalizers are run). There are some subtleties to take into account, see the link above for more.

In our case, the chunks of memory of interest are:
  1. The parameter blocks, altogether comprising the state vector optimized by Ceres.
  2. Data used to compute the cost function for each residual block.
  3. The Jacobian matrices.
We really have no control on (3), short of rewriting big chunks of Ceres code, so storage for the Jacobians will be C++ owned, and we'll need to copy its content when required for the computation of the cost function derivatives JVM-side.

For (2) the design choice is a no brainer: these data are only "felt" by Ceres indirectly through the CostFunction interface, but they are not actually accessed by the library itself. Therefore there is really no reason for shifting their ownership to C++: the JVM can keep them just fine, thank you very much and see y'all later. 

We have some design freedom for (1), but it is rather easy to see that having that memory held by Java leads to significant code complexity for little gain. The parameter blocks need to be manipulated at a low level by the Ceres solver directly through pointers. If their memory were held by Java we'd need a shim of C code, including JNI env calls to GetArrayLength, GetDoubleArrayElements, to get in C++ the pointers to the Java arrays. In some JVM implementations these actually cause memory to be duplicated, but the saner ones prefer to "pin" the memory buffers to prevent the garbage collector from touching them. The logical place for such shims would be in the wrappers for Ceres's Problem::AddParameterBlock and Problem::AddResidualBlock. So we'd need to either explicit subclass Problem in C++ for the purposes of this port, or wrap it in a Decorator, or take advantage of SWIG's  %extend mechanism (which amounts to the same thing). However, we'd also need to release the Java memory from the C side every time we call the cost function, as well as when the solver is done, through a JNI env call to ReleaseDoubleArrayElements. If we don't do it, Java would either never "see" the result of the solver updates to the state (if the JVM copied the buffers), or never be able to GC the parameter blocks if they were pinned (thus leaking memory). Therefore we'd need another C shim, to be called during the optimization and at its end. The shim would need to be wrapped in Java, and there really isn't an obvious place for it. For the optimization end part, we could perhaps place it in the destructor of our Problem subclass (to be called by the finalizer of its Java wrap), but for this to work the subclass instance would need to keep track of all the parameter blocks it has handled, and this bookkeeping would further complicate the implementation of the AddParameterBloc/AddResidualBlock overloads. 

All in all, it seems much saner to let C++ own the parameter block memory, and just copy it out to Java when we are done. SWIG offers a nice helper for this approach in its carrays.i library: it allows one to allocate an array C-side and get an opaque pointer-like Java object for it, plus methods for reading from and writing to it. These methods cross the JNI boundary, so they are a not-so-cheap addition to the cost function evaluation, but this seems an acceptable price to pay at least for now.

So our design will have:
  1. Parameter blocks: C++ owned, copied to JVM upon cost function evaluation and after a solver's run.
  2. Cost function data: JVM-owned.
  3. Jacobians: C++ owned.
That's it for today. With this decision we are ready to attempt our first Ceres wrap.