Monday, January 18, 2016

On calling Ceres from Scala - 4.2: Hello World, a brief note on speed

This is the sixth installment of a series documenting my progress on porting the Ceres solver API to Scala. All my sources quoted below are available on github.

Hello World, A Brief Note on Speed


We all know how, in computer science, speed is often reported in terms of lies, big lies and benchmarks. However, silly benchmarks are fun, so let's write one.

On my 2.3 GHz i7 mac, the ceres-reported timings for a run of the native C++ implementation of the "helloworld" example are:

$ ./bin/helloworld
iter cost cost_change |gradient| |step| tr_ratio tr_radius ls_iter iter_time total_time
0 4.512500e+01 0.00e+00 9.50e+00 0.00e+00 0.00e+00 1.00e+04 0 2.00e-05 1.76e-04
1 4.511598e-07 4.51e+01 9.50e-04 9.50e+00 1.00e+00 3.00e+04 1 7.89e-05 3.20e-04
2 5.012552e-16 4.51e-07 3.17e-08 9.50e-04 1.00e+00 9.00e+04 1 1.10e-05 3.49e-04
Ceres Solver Report: Iterations: 2, Initial cost: 4.512500e+01, Final cost: 5.012552e-16, Termination: CONVERGENCE
x : 0.5 -> 10

My Scala implementation based on the SWIG-generated wrap (HelloWorld.scala) instead clocks at:

+ export LD_LIBRARY_PATH=.:/Users/francesco/src/ceres-bin/lib:/opt/twitter/Cellar//glog/0.3.4/lib:/opt/twitter/Cellar//gflags/2.1.2/lib
+ LD_LIBRARY_PATH=.:/Users/francesco/src/ceres-bin/lib:/opt/twitter/Cellar//glog/0.3.4/lib:/opt/twitter/Cellar//gflags/2.1.2/lib
+ scala -classpath classes org.somelightprojections.skeres.HelloWorld
iter cost cost_change |gradient| |step| tr_ratio tr_radius ls_iter iter_time total_time
0 4.512500e+01 0.00e+00 9.50e+00 0.00e+00 0.00e+00 1.00e+04 0 2.23e-03 1.66e-02
1 4.511598e-07 4.51e+01 9.50e-04 9.50e+00 1.00e+00 3.00e+04 1 1.27e-03 1.84e-02
2 5.012552e-16 4.51e-07 3.17e-08 9.50e-04 1.00e+00 9.00e+04 1 3.10e-05 1.85e-02
Ceres Solver Report: Iterations: 2, Initial cost: 4.512500e+01, Final cost: 5.012552e-16, Termination: CONVERGENCE
x : 0.5 -> 9.999999968337555
view raw HelloWorld run hosted with ❤ by GitHub
This is not an apple-to-apple comparison, as the C++ version's cost function uses automatic differentiation, but the AD overhead should be in the noise.

So Scala is about 3x slower in terms of iteration time, and about 53x slower in total time, and that's OMG horrible right?

Well, not quite. The native version is statically optimized ("c++ -O3") in its entirety, whereas the port side calls JVM routines at every cost function evaluation. So, to be fair, we ought to give the JIT a chance to work its magic.

Running the optimization problem Scala-side 10 times in a row in the same process, but reinitializing all data structures at every run, should give us a better idea of the JIT-induced speedups.

+ export LD_LIBRARY_PATH=.:/Users/francesco/src/ceres-bin/lib:/opt/twitter/Cellar//glog/0.3.4/lib:/opt/twitter/Cellar//gflags/2.1.2/lib
+ LD_LIBRARY_PATH=.:/Users/francesco/src/ceres-bin/lib:/opt/twitter/Cellar//glog/0.3.4/lib:/opt/twitter/Cellar//gflags/2.1.2/lib
+ scala -classpath classes org.somelightprojections.skeres.HelloWorld
iter cost cost_change |gradient| |step| tr_ratio tr_radius ls_iter iter_time total_time
0 4.512500e+01 0.00e+00 9.50e+00 0.00e+00 0.00e+00 1.00e+04 0 1.03e-03 1.14e-03
1 4.511598e-07 4.51e+01 9.50e-04 9.50e+00 1.00e+00 3.00e+04 1 7.80e-05 1.28e-03
2 5.012552e-16 4.51e-07 3.17e-08 9.50e-04 1.00e+00 9.00e+04 1 2.19e-05 1.32e-03
Ceres Solver Report: Iterations: 2, Initial cost: 4.512500e+01, Final cost: 5.012552e-16, Termination: CONVERGENCE
x : 0.5 -> 9.999999968337555
iter cost cost_change |gradient| |step| tr_ratio tr_radius ls_iter iter_time total_time
0 4.512500e+01 0.00e+00 9.50e+00 0.00e+00 0.00e+00 1.00e+04 0 2.72e-05 6.51e-05
1 4.511598e-07 4.51e+01 9.50e-04 9.50e+00 1.00e+00 3.00e+04 1 5.10e-05 1.48e-04
2 5.012552e-16 4.51e-07 3.17e-08 9.50e-04 1.00e+00 9.00e+04 1 4.41e-05 2.17e-04
Ceres Solver Report: Iterations: 2, Initial cost: 4.512500e+01, Final cost: 5.012552e-16, Termination: CONVERGENCE
x : 0.5 -> 9.999999968337555
[...]
iter cost cost_change |gradient| |step| tr_ratio tr_radius ls_iter iter_time total_time
0 4.512500e+01 0.00e+00 9.50e+00 0.00e+00 0.00e+00 1.00e+04 0 1.10e-05 2.41e-05
1 4.511598e-07 4.51e+01 9.50e-04 9.50e+00 1.00e+00 3.00e+04 1 2.69e-05 7.41e-05
2 5.012552e-16 4.51e-07 3.17e-08 9.50e-04 1.00e+00 9.00e+04 1 1.91e-05 1.05e-04
Ceres Solver Report: Iterations: 2, Initial cost: 4.512500e+01, Final cost: 5.012552e-16, Termination: CONVERGENCE
x : 0.5 -> 9.999999968337555

Note the rather dramatic speedup after the first iteration, and the eventual convergence to essentially the same runtime as the native implementation.

This is, of course, just a toy benchmark, but the short answer to "Yo, how much slower is this port going to be compared to native?" should be (a) there is no short answer and (b) tell me more about your use case.

No comments:

Post a Comment