Computer Science FSU Science

Code Walkthrough: Inverse Modeling

In previous articles (here and earlier here) I covered in some cursory way my current research into the California Current and about the technique of Inverse Modeling, so today I wanted to delve into the actual tools and code that I’ve developed.  To review, an inverse model is

An inverse model seeks to solve a series of linear equations (Soetaert and Van Oevelen 2014) which state our understanding of the nutrient flows involved in the ecosystem under study. For example, in our study we will be looking at carbon flows between plankton, fish, and various organic carbon pools (e.g. microzoa, sardines and detritus). This series of linear equations is then codified as a set of matrix equations.

Screen Shot 2015-04-19 at 20.52.16

Without delving too far into the linear algebra, which is beyond the scope of this article, these sets of equations represent–in the simplest case–the mass balance equations (1); the various field measurements (2); and a wide variety of physiological factors including growth rates and efficiencies, and observations (3).

The focus here on out will be on the code and any discussion of the theory, mathematics or oceanography involved in the research will be limited to that which is directly applicable to the code at hand. Links and notes for relevant aspects will be provided wherever possible. Let’s get into it then.

To provide a little bit of an overview, which I think is important whenever details might obscure larger scale structure, here is a diagram of the code hierarchy.

Basic structure of the code base shown without libraries or interconnectedness.
Basic structure of the code base shown without libraries or interconnectedness.

The primary entrance point to the code is through a file called ‘AnalysisHead.r’ and it allows us to run the model, analyze the results and other large scale tasks through it’s subordinate scripts. It also handles the primary UI (command line) and help functionality so it’s code is rather unpleasant to step through. I’ve opted to include the entire source code of the all the files in the Appendix at the end of the article, and throughout I will use code snippets and representations to illustrate the process.

To begin off we’ll start at the first commands to be executed and move on from there.

require(compiler)
enableJIT(3)
compilePKGS(enable=TRUE)

args <- commandArgs(trailingOnly = TRUE)			## Argumens of the form: <id> <options>

head(args)

The first calls all have to do with optimization[1] and are merely interesting to those esoteric few which find that stuff interesting (whom I count myself as one), but after those comes the real start. We take the args from the command line and pass them to the head() function which processes our input and starts the whole machine running. By using command line arguments we save ourselves from hard coding everything and makes trial-and-error time much lower.

Here is sample command-line call that initiates, -i, a new model run for data from cycle 3 and saves the results as 001.

Rscript AnalysisHead.r -i 3 001

This script calls MCMC.r which does all the setup for the model by reading in spreadsheets, organizing data and ultimately doing everything up until the simulation is actually running. The majority of the code here is simply picking values off of the spreadsheet and saving them in an organized vector or matrix, for example take a look at the code below.

  ## Read in the Ae matrix from spreadsheet
  Ae = t(data.matrix(sheetA)[,1:exact_eq])

  ## Pull specific values for h vector
  h[[12]] = sheetB[58]		#Smz-mn2 = Small Gr Min Res (100-surf)
  h[[13]] = sheetB[60]		#Lmz-mn2 = Lg Gr Min Res (100-surf)
  h[[14]] = sheetB[62]		#Gel-mn2 = Gel pred Min Res (100-surf)
  h[[15]] = sheetB[57]		#DeepSmz-mn2 = Sm Gr Min Res (450-100m)
  h[[16]] = sheetB[59]		#DeepLmz-mn2 = Lg Gr Min Res (450-100m)
  h[[17]] = sheetB[61]		#DeepGel-mn2 = GelPred Min Res (450-100m)
  h[[18]] = sheetB[56]		#Sar = Sardine Min Res

Once MCMC.r has read in the model’s structure, the matrices Aa, Ae and G are built along with their respective vectors \vec{ba}, \vec{be} and \vec{h} which together codify all the constraints of the inverse model. The matrices and vectors are then passed off to Xsample who is the workhorse of the Monte Carlo method employed in my work.

sol = xsample(A=Aa,B=ba, E=Ae, F=be, G=Gg, H=h, sdB=sdb, iter=iter_count, outputlength=iter_out,
              jmp=jmp, burninlength=iter_burn, type="mirror")

After a mater of minutes or hours, the Xsample script returns a data-frame[2] of possible model solutions along with the mean and standard deviation of all the solutions. The quantity of output, typically 1000’s of vectors, is the main asset of using a monte carlo method to explore the solution space of the model.

Let’s make sure we’re all on the same page, here is a script that should give a pretty good idea of what a Monte Carlo method is.

MC = function(n) {
    ## A = pi*r^2
    ## First, let's plot a cicle
    x = seq(0,1,0.01)
    plot(x=x, y = sqrt(1-x^2), xlim=c(0,1), ylim=c(0,1), type='l', ylab="y")
    
    ## Then generate points and plot them
    naccept = 0
    for (i in 1:n) {
        x = runif(1) ## Random x
        y = runif(1) ## Random y
        if (x^2+y^2 < 1) { ## Does point lie inside circle?
            naccept = naccept + 1
            points(x,y, pch=3)
        }
        else
            points(x,y,col="red",pch=3)
    }
    text(x=0,y=0.1,paste(naccept*4.0/n))
}

There are numerous formulas[3] to calculate pi, but don’t think that makes it easy. Here is a personal favorite[4]:30c013b91f90b075196b6b964c2c5010I’m pretty sure you need a masters in theoretical mathematics just to make sense of the formulas, so I’m going to calculate it with some brute force(i.e. a Monte Carlo simulation). To do this all you need is to check if a randomly generated point lies inside or outside of a circle. If my high school geometry class was right[5], a point is inside a unit circle if x^2+y^2 < 1.

The estimate with 100 points is 3.36.
The estimate with 100 points is 3.36.

By running the script for n=10, we are testing it for 10 points, and then by taking the ratio of the points inside the circle to outside the circle we can find the value of \pi. When only using 10 points, the estimated value of \pi = 3.6, but by taking more samples we improve our estimate. With 100,000 samples the script finds \pi=3.1416 which is accurate to 0.0002 percent. Not bad for a method that uses random numbers and some simple geometry.

For my work, there is no simple geometric analog, since instead of working in 2-D it is in 105-D; but nevertheless it works in much the same way. Xsample also uses some fancy algorithms to negotiate the region and does it’s best to give us an idea of the solution’s shape and size[6].

The majority of the Xsample script was written by Karel Van den Meersche et al.[7] and can be found in the limSolve package on CRAN[8], so the version I present here is optimized for my usage. Once the Xsample script returns, the MCMC.r function simply saves the data and generates some human-readable spreadsheets before closing.

save(sol, Aa,Ae, ba, be,Gg,h,sdb,cond, file=paste("./data/", condhash, ".RData",sep='') )  reateSheet(newwb, name="Data") createSheet(newwb, name="Sol") createSheet(newwb, name="G") createSheet(newwb, name="h") createSheet(newwb, name="Aa") createSheet(newwb, name="ba") createSheet(newwb, name="Ae") createSheet(newwb, name="be")</pre> Thinking through and figuring out ways to make accessing the output as painless as possible can be daunting, but once done the payout is 10 fold.  Just as important as saving the output of the simulation is analyzing the output of the simulation. This task falls to Analyze.r which generates nearly a dozen graphs and charts of nearly publication-quality, and all this requires is a simple command: <pre class="width-set:true width:300 lang:r decode:true">Rscript AnalysisHead.r -a <name></pre> [gallery columns="5" ids="1097,1167,1168,1169,1170,1171,1172,1173,1174,1092"]  So where does this leave us? We can now run the model, save the output and the generate output and analyses, but we still need to know what questions to ask. A good question is the basis for any productive scientific endeavor, even Lonergan knew that<a href="#References">[9]</a>.  <hr />  <h2 id="References">Notes</h2> <ol>  	<li>The <em>compiler</em> package in R permits Just In-Time (JIT) compilation of code. This is immensely useful for looping when vectorization is not practical. For a complete discussion on the optimization of my code, please see <a href="http://tkelly.org/2015/05/optimizing-r-code-in-my-model/" target="_blank">here</a>.</li>  	<li>A <em>data-frame</em> is a type of data structure in R. See this <a href="http://www.r-tutor.com/r-introduction/data-frame" target="_blank">page</a> for description.</li>  	<li><a href="https://en.wikipedia.org/wiki/Pi#Computer_era_and_iterative_algorithms" target="_blank">Wikipedia</a> has an excellent description of the history of the calculation of\pi.</li>  	<li>This formula, the Chudnovsky Algorithm, still amazes me every time I see it (<a href="https://en.wikipedia.org/wiki/Chudnovsky_algorithm" target="_blank">wiki</a>).</li>  	<li>Looks like geometry class was right after all (<a href="https://en.wikipedia.org/wiki/Circle#Equations" target="_blank">wikipedia</a>).</li>  	<li>The algorithm is actually a random walk Monte Carlo simulation where the next solution is correlated with the previous one with a known probability. This more advanced form of MC is useful when you need a bias towards <em>more realistic</em> solutions is required.</li>  	<li>Van den Meersche, K., Soetaert K., van Oevelen, D. (2009). xsample(): An R Function for Sampling Linear Inverse Problems. Journal of Statistical Software, Code Snippets, vol 30, 1-15.</li>  	<li>CRAN is a central repository of R packages, and the limSolve package can be found <a href="http://cran.r-project.org/web/packages/limSolve/index.html" target="_blank">here</a>.</li>  	<li>Lonergan's philosophy is quite apposite (<a href="http://tkelly.org/2015/05/philosophy-monday-lonergan-part-1/">link</a>).</li> </ol>    <hr />  <h1>Appendix</h1> Here is the entire source code of the file for clarity. <pre class="height-set:true lang:r decode:true" title="AnalysisHead.r"># AnalysisHead.r - 2015 Thomas Bryce Kelly # # This script is designed to be the main entrance point into the inverse modeling system. This script can be used to call: #   * MCMC2.r:mcmc(cond) : loads the worksheet and calculates the xsample solution based on <cond>. #   * Analysis.r:analyze(hash, list) : loads data from ./data/<hash>.RData file and generates the plots specified in <list> #   * continueRun.r:continueRun(hash, cond) : loads data from ./data/<hash>.RData and runs xsample based on <cond> AND from last solution of loaded data. #  # <cond> : data.frame(cycle=NULL, iter_count=NULL, iter_out=NULL, iter_burn=NULL, jmp=NULL, hash=NULL) # <list> : c(0,0,0,0,0,0,0)  require(compiler) enableJIT(3) compilePKGS(enable=TRUE)  args <- commandArgs(trailingOnly = TRUE)		## Arguments of the form: <id> <options>   head = function(args) { setwd("~/Dropbox/FSU/Inverse Modeling") suppressMessages(library("digest")) suppressMessages(source("Analysis.r")) suppressMessages(source("MCMC.r")) suppressMessages(source("continueRun.r"))  ########### All of the functions follow:  help = function() { 	print("To use the program please follow the outlines below") 	print("    -i <cycle> <hash>		- Runs mcmc with the specified conditions") 	print("    -a <hash>				- Analyzes the data in <hash>") 	print("    -c <hashs>				- Continues the runs in <hashs> with new conditions") 	print("    -c2 <hashs>				- Continues the runs in <hashs> with old conditions") }  init = function(cycle=3, hash=digest(algo='crc32', runif(1))) { 	iter = 10000000 	cond = data.frame(cycle=as.numeric(cycle), iter_count=iter, iter_out=1000, iter_burn=iter/10, jmp=4, hash=hash) 	mcmc(cond) 	print(condhash)
}

## Continue Runs for elements of <l> ##
continue = function(l) {
	cond = data.frame(cycle=4, iter_count=2500000, iter_out=1000, iter_burn=1000000, jmp=4, hash="00000000")
	for (i in c(1:length(l))){
		continueRun(l[i],cond)
	}
}

## Continue Runs for elements of <l> based on original cond ##
continue2 = function(l) {
	for (i in c(1:length(l))){
		print(paste("Running continue run 1 for", i))
		continueRun2(l[i])
	}
}

## Run Analysis for elements of <l> ##

for (i in c(1:length(l))) {
#  analyze(l[i], c(1,1,1,1,1,1,1))
#}

## Analyze set for elements in <l>
anset = function(l) {
	analyzeSet(l,plots=c(1,1,1,1,1,1,1))
}

## Analyze run data
ana = function(l) {
	for ( i in 1:length(l)) {
		analyze(hash, c(1,1,1,1,1,1,1))
	}
}

####### The Command structure


print(args[1]=='-h')
if(length(args) < 1) {
	print("Please specify an option")
	help()
	
}

if (args[1] == '-h') { help() }
if (args[1] == '-i') {
	if (length(args) == 3) {
		init(args[2], args[3])
	} else {
		print("Insufficient argumetns")
		help()
	}
}
if (args[1] == '-a') {
	if (length(args) == 2) ana(c(args[2]))
	else {
		print("Insufficient argumetns")
		help()
	}
}
}

head(args)

 

 

#' MCMC2.r - 2015 Thomas Bryce Kelly
#' 
#' This script is the first stage of the inverse modeling workflow. Here the excel model is imported and the matricies and arrays are generated. Finally xsample() is called using <cond>. The important function calls are:
#'   * XLConnect:loadWorkbook(path) : Loads the workbook
#'   * XLConnect:readWorksheet(workbook, sheet, ...) : Reads data from workbook.
#'   * misc:log(string, cond) : This uses the method specified in misc.r to save to the ./misc.csv file the <string> followed by a list of values within <cond>. Useful for later reference.
#'   * Twitter.r:up(title, cond, name) : This function uploads the file <name>, logs the upload, and notifies via twitter that <title> was uploaded.
#'   * Twitter.r:notify(title, cond, name) : This function notifies via twitter that <name> has been uploaded (must be done separately) and logs the event.
#'   * save() : Standard system save(). By default model data is saved in ./data/<condhash>.RData. #'   * xsample.r:xsample( <see fn call below> ) : Used the data loaded from load(). By default uses the last solution found in solX to initalize x0. This can be changed.
#' 
#' <cond> : data.frame(cycle=NULL, iter_count=NULL, iter_out=NULL, iter_burn=NULL, jmp=NULL, hash=NULL)

mcmc = function(cond=data.frame(cycle=1, iter_count=100000, iter_out=100, iter_burn=10000, jmp=1, hash=NULL) {
  library(XLConnect)
  library(Matrix)
  library(MASS)
  library(ggplot2)
  library(Hmisc)
  library(limSolve)
  library(digest)
  library(diagram)
  
  source("xsample.r")
  source("Twitter.r")
  source("misc.r")
  
  ## Setup environment
  WB = loadWorkbook("./Spreadsheets/EndToEnd2Layer_TK105.xls")
  log(paste("Loaded Workbook",WB@filename),cond)
  start_cell = c(5,6)
  exact_eq = 23
  approx_eq = 14
  inequal = 79
  flows = 105
  
  if(is.null(condhash))   	hash = digest(algo="crc32", runif(1))      ## Read in the data   sheetA = readWorksheet(WB, sheet="AllFlows", start_cell[1], start_cell[2] ,start_cell[1] + flows, start_cell[2]+approx_eq+exact_eq+inequal+1, colTypes="numeric" )   sheetB = readWorksheet(WB, sheet="CycleData", 2, 4,68, 20, colTypes="numeric" )   lab = as.vector(readWorksheet(WB, sheet="AllFlows", start_cell[1], 1,start_cell[1]+flows, 1, colTypes="character" ))   sheetB = data.matrix(sheetB)      cycle_CI = as.numeric(sheetB[,2*condcycle])/2.0
  sheetB = sheetB[,2*cond