Code Walkthrough: Inverse Modeling

In previous articles (here and earlier here) I covered in some cursory way my current research into the California Current and about the technique of Inverse Modeling, so today I wanted to delve into the actual tools and code that I’ve developed.  To review, an inverse model is

An inverse model seeks to solve a series of linear equations (Soetaert and Van Oevelen 2014) which state our understanding of the nutrient flows involved in the ecosystem under study. For example, in our study we will be looking at carbon flows between plankton, fish, and various organic carbon pools (e.g. microzoa, sardines and detritus). This series of linear equations is then codified as a set of matrix equations.

Screen Shot 2015-04-19 at 20.52.16

Without delving too far into the linear algebra, which is beyond the scope of this article, these sets of equations represent–in the simplest case–the mass balance equations (1); the various field measurements (2); and a wide variety of physiological factors including growth rates and efficiencies, and observations (3).

The focus here on out will be on the code and any discussion of the theory, mathematics or oceanography involved in the research will be limited to that which is directly applicable to the code at hand. Links and notes for relevant aspects will be provided wherever possible. Let’s get into it then.

To provide a little bit of an overview, which I think is important whenever details might obscure larger scale structure, here is a diagram of the code hierarchy.

Basic structure of the code base shown without libraries or interconnectedness.
Basic structure of the code base shown without libraries or interconnectedness.

The primary entrance point to the code is through a file called ‘AnalysisHead.r’ and it allows us to run the model, analyze the results and other large scale tasks through it’s subordinate scripts. It also handles the primary UI (command line) and help functionality so it’s code is rather unpleasant to step through. I’ve opted to include the entire source code of the all the files in the Appendix at the end of the article, and throughout I will use code snippets and representations to illustrate the process.

To begin off we’ll start at the first commands to be executed and move on from there.

The first calls all have to do with optimization[1] and are merely interesting to those esoteric few which find that stuff interesting (whom I count myself as one), but after those comes the real start. We take the args from the command line and pass them to the head() function which processes our input and starts the whole machine running. By using command line arguments we save ourselves from hard coding everything and makes trial-and-error time much lower.

Here is sample command-line call that initiates, -i, a new model run for data from cycle 3 and saves the results as 001.

This script calls MCMC.r which does all the setup for the model by reading in spreadsheets, organizing data and ultimately doing everything up until the simulation is actually running. The majority of the code here is simply picking values off of the spreadsheet and saving them in an organized vector or matrix, for example take a look at the code below.

Once MCMC.r has read in the model’s structure, the matrices Aa, Ae and G are built along with their respective vectors \vec{ba}, \vec{be} and \vec{h} which together codify all the constraints of the inverse model. The matrices and vectors are then passed off to Xsample who is the workhorse of the Monte Carlo method employed in my work.

After a mater of minutes or hours, the Xsample script returns a data-frame[2] of possible model solutions along with the mean and standard deviation of all the solutions. The quantity of output, typically 1000’s of vectors, is the main asset of using a monte carlo method to explore the solution space of the model.

Let’s make sure we’re all on the same page, here is a script that should give a pretty good idea of what a Monte Carlo method is.

There are numerous formulas[3] to calculate pi, but don’t think that makes it easy. Here is a personal favorite[4]:30c013b91f90b075196b6b964c2c5010I’m pretty sure you need a masters in theoretical mathematics just to make sense of the formulas, so I’m going to calculate it with some brute force(i.e. a Monte Carlo simulation). To do this all you need is to check if a randomly generated point lies inside or outside of a circle. If my high school geometry class was right[5], a point is inside a unit circle if x^2+y^2 < 1.

The estimate with 100 points is 3.36.
The estimate with 100 points is 3.36.

By running the script for n=10, we are testing it for 10 points, and then by taking the ratio of the points inside the circle to outside the circle we can find the value of \pi. When only using 10 points, the estimated value of \pi = 3.6, but by taking more samples we improve our estimate. With 100,000 samples the script finds \pi=3.1416 which is accurate to 0.0002 percent. Not bad for a method that uses random numbers and some simple geometry.

For my work, there is no simple geometric analog, since instead of working in 2-D it is in 105-D; but nevertheless it works in much the same way. Xsample also uses some fancy algorithms to negotiate the region and does it’s best to give us an idea of the solution’s shape and size[6].

The majority of the Xsample script was written by Karel Van den Meersche et al.[7] and can be found in the limSolve package on CRAN[8], so the version I present here is optimized for my usage. Once the Xsample script returns, the MCMC.r function simply saves the data and generates some human-readable spreadsheets before closing.

Thinking through and figuring out ways to make accessing the output as painless as possible can be daunting, but once done the payout is 10 fold.

Just as important as saving the output of the simulation is analyzing the output of the simulation. This task falls to Analyze.r which generates nearly a dozen graphs and charts of nearly publication-quality, and all this requires is a simple command:

So where does this leave us? We can now run the model, save the output and the generate output and analyses, but we still need to know what questions to ask. A good question is the basis for any productive scientific endeavor, even Lonergan knew that[9].


Notes

  1. The compiler package in R permits Just In-Time (JIT) compilation of code. This is immensely useful for looping when vectorization is not practical. For a complete discussion on the optimization of my code, please see here.
  2. A data-frame is a type of data structure in R. See this page for description.
  3. Wikipedia has an excellent description of the history of the calculation of \pi.
  4. This formula, the Chudnovsky Algorithm, still amazes me every time I see it (wiki).
  5. Looks like geometry class was right after all (wikipedia).
  6. The algorithm is actually a random walk Monte Carlo simulation where the next solution is correlated with the previous one with a known probability. This more advanced form of MC is useful when you need a bias towards more realistic solutions is required.
  7. Van den Meersche, K., Soetaert K., van Oevelen, D. (2009). xsample(): An R Function for Sampling Linear Inverse Problems. Journal of Statistical Software, Code Snippets, vol 30, 1-15.
  8. CRAN is a central repository of R packages, and the limSolve package can be found here.
  9. Lonergan’s philosophy is quite apposite (link).

 


Appendix

Here is the entire source code of the file for clarity.