In previous articles (here and earlier here) I covered in some cursory way my current research into the California Current and about the technique of Inverse Modeling, so today I wanted to delve into the actual tools and code that I’ve developed. To review, an inverse model is

An inverse model seeks to solve a series of linear equations (Soetaert and Van Oevelen 2014) which state our understanding of the nutrient flows involved in the ecosystem under study. For example, in our study we will be looking at carbon flows between plankton, fish, and various organic carbon pools (e.g. microzoa, sardines and detritus). This series of linear equations is then codified as a set of matrix equations.

Without delving too far into the linear algebra, which is beyond the scope of this article, these sets of equations represent–in the simplest case–the mass balance equations (1); the various field measurements (2); and a wide variety of physiological factors including growth rates and efficiencies, and observations (3).

The focus here on out will be on the code and any discussion of the theory, mathematics or oceanography involved in the research will be limited to that which is directly applicable to the code at hand. Links and notes for relevant aspects will be provided wherever possible. Let’s get into it then.

To provide a little bit of an overview, which I think is important whenever details might obscure larger scale structure, here is a diagram of the code hierarchy.

The primary entrance point to the code is through a file called ‘AnalysisHead.r’ and it allows us to run the model, analyze the results and other large scale tasks through it’s subordinate scripts. It also handles the primary UI (command line) and help functionality so it’s code is rather unpleasant to step through. I’ve opted to include the entire source code of the all the files in the Appendix at the end of the article, and throughout I will use code snippets and representations to illustrate the process.

To begin off we’ll start at the first commands to be executed and move on from there.

require(compiler) enableJIT(3) compilePKGS(enable=TRUE) args <- commandArgs(trailingOnly = TRUE) ## Argumens of the form: <id> <options> head(args)

The first calls all have to do with optimization[1] and are merely interesting to those esoteric few which find that stuff interesting (whom I count myself as one), but after those comes the real start. We take the **args** from the command line and pass them to the **head()** function which processes our input and starts the whole machine running. By using command line arguments we save ourselves from hard coding everything and makes trial-and-error time much lower.

Here is sample command-line call that initiates, **-i**, a new model run for data from cycle **3** and saves the results as **001**.

Rscript AnalysisHead.r -i 3 001

This script calls MCMC.r which does all the setup for the model by reading in spreadsheets, organizing data and ultimately doing everything up until the simulation is actually running. The majority of the code here is simply picking values off of the spreadsheet and saving them in an organized vector or matrix, for example take a look at the code below.

## Read in the Ae matrix from spreadsheet Ae = t(data.matrix(sheetA)[,1:exact_eq]) ## Pull specific values for h vector h[[12]] = sheetB[58] #Smz-mn2 = Small Gr Min Res (100-surf) h[[13]] = sheetB[60] #Lmz-mn2 = Lg Gr Min Res (100-surf) h[[14]] = sheetB[62] #Gel-mn2 = Gel pred Min Res (100-surf) h[[15]] = sheetB[57] #DeepSmz-mn2 = Sm Gr Min Res (450-100m) h[[16]] = sheetB[59] #DeepLmz-mn2 = Lg Gr Min Res (450-100m) h[[17]] = sheetB[61] #DeepGel-mn2 = GelPred Min Res (450-100m) h[[18]] = sheetB[56] #Sar = Sardine Min Res

Once MCMC.r has read in the model’s structure, the matrices , and are built along with their respective vectors , and which together codify all the constraints of the inverse model. The matrices and vectors are then passed off to Xsample who is the workhorse of the Monte Carlo method employed in my work.

sol = xsample(A=Aa,B=ba, E=Ae, F=be, G=Gg, H=h, sdB=sdb, iter=iter_count, outputlength=iter_out, jmp=jmp, burninlength=iter_burn, type="mirror")

After a mater of minutes or hours, the Xsample script returns a data-frame[2] of possible model solutions along with the mean and standard deviation of all the solutions. The quantity of output, typically 1000’s of vectors, is the main asset of using a monte carlo method to explore the solution space of the model.

Let’s make sure we’re all on the same page, here is a script that should give a pretty good idea of what a Monte Carlo method is.

MC = function(n) { ## A = pi*r^2 ## First, let's plot a cicle x = seq(0,1,0.01) plot(x=x, y = sqrt(1-x^2), xlim=c(0,1), ylim=c(0,1), type='l', ylab="y") ## Then generate points and plot them naccept = 0 for (i in 1:n) { x = runif(1) ## Random x y = runif(1) ## Random y if (x^2+y^2 < 1) { ## Does point lie inside circle? naccept = naccept + 1 points(x,y, pch=3) } else points(x,y,col="red",pch=3) } text(x=0,y=0.1,paste(naccept*4.0/n)) }

There are numerous formulas[3] to calculate pi, but don’t think that makes it easy. Here is a personal favorite[4]:I’m pretty sure you need a masters in theoretical mathematics just to make sense of the formulas, so I’m going to calculate it with some brute force(i.e. a Monte Carlo simulation). To do this all you need is to check if a randomly generated point lies inside or outside of a circle. If my high school geometry class was right[5], a point is inside a unit circle if .

By running the script for , we are testing it for 10 points, and then by taking the ratio of the points inside the circle to outside the circle we can find the value of . When only using 10 points, the estimated value of , but by taking more samples we improve our estimate. With 100,000 samples the script finds which is accurate to 0.0002 percent. Not bad for a method that uses random numbers and some simple geometry.

For my work, there is no simple geometric analog, since instead of working in 2-D it is in 105-D; but nevertheless it works in much the same way. Xsample also uses some fancy algorithms to negotiate the region and does it’s best to give us an idea of the solution’s shape and size[6].

The majority of the Xsample script was written by Karel Van den Meersche et al.[7] and can be found in the *limSolve* package on CRAN[8], so the version I present here is optimized for my usage. Once the Xsample script returns, the MCMC.r function simply saves the data and generates some human-readable spreadsheets before closing.

save(sol, Aa,Ae, ba, be,Gg,h,sdb,cond, file=paste("./data/", cond\pihash) } ## Continue Runs for elements of <l> ## continue = function(l) { cond = data.frame(cycle=4, iter_count=2500000, iter_out=1000, iter_burn=1000000, jmp=4, hash="00000000") for (i in c(1:length(l))){ continueRun(l[i],cond) } } ## Continue Runs for elements of <l> based on original cond ## continue2 = function(l) { for (i in c(1:length(l))){ print(paste("Running continue run 1 for", i)) continueRun2(l[i]) } } ## Run Analysis for elements of <l> ## for (i in c(1:length(l))) { # analyze(l[i], c(1,1,1,1,1,1,1)) #} ## Analyze set for elements in <l> anset = function(l) { analyzeSet(l,plots=c(1,1,1,1,1,1,1)) } ## Analyze run data ana = function(l) { for ( i in 1:length(l)) { analyze(hash, c(1,1,1,1,1,1,1)) } } ####### The Command structure print(args[1]=='-h') if(length(args) < 1) { print("Please specify an option") help() } if (args[1] == '-h') { help() } if (args[1] == '-i') { if (length(args) == 3) { init(args[2], args[3]) } else { print("Insufficient argumetns") help() } } if (args[1] == '-a') { if (length(args) == 2) ana(c(args[2])) else { print("Insufficient argumetns") help() } } } head(args)

#' MCMC2.r - 2015 Thomas Bryce Kelly #' #' This script is the first stage of the inverse modeling workflow. Here the excel model is imported and the matricies and arrays are generated. Finally xsample() is called using <cond>. The important function calls are: #' * XLConnect:loadWorkbook(path) : Loads the workbook #' * XLConnect:readWorksheet(workbook, sheet, ...) : Reads data from workbook. #' * misc:log(string, cond) : This uses the method specified in misc.r to save to the ./misc.csv file the <string> followed by a list of values within <cond>. Useful for later reference. #' * Twitter.r:up(title, cond, name) : This function uploads the file <name>, logs the upload, and notifies via twitter that <title> was uploaded. #' * Twitter.r:notify(title, cond, name) : This function notifies via twitter that <name> has been uploaded (must be done separately) and logs the event. #' * save() : Standard system save(). By default model data is saved in ./data/<condX to initalize x0. This can be changed. #' #' <cond> : data.frame(cycle=NULL, iter_count=NULL, iter_out=NULL, iter_burn=NULL, jmp=NULL, hash=NULL) mcmc = function(cond=data.frame(cycle=1, iter_count=100000, iter_out=100, iter_burn=10000, jmp=1, hash=NULL) { library(XLConnect) library(Matrix) library(MASS) library(ggplot2) library(Hmisc) library(limSolve) library(digest) library(diagram) source("xsample.r") source("Twitter.r") source("misc.r") ## Setup environment WB = loadWorkbook("./Spreadsheets/EndToEnd2Layer_TK105.xls") log(paste("Loaded Workbook",WB@filename),cond) start_cell = c(5,6) exact_eq = 23 approx_eq = 14 inequal = 79 flows = 105 if(is.null(condcycle])/2.0 sheetB = sheetB[,2*cond