Modern Data Organization for the Modern Scientist

Scientists handle a lot of data, I mean a lot of data. The type and format of the data varies, but the predominate format used in oceanography is definitely the humble spreadsheet. Without firsthand knowledge of the practices common in other disciplines I cannot comment on the universality of spreadsheets, but I imagine that the situation is similar.

Spreadsheets are a natural way to organize our data since the manipulation of the data is straightforward (e.g. fill column D with the sum of columns B and C), the files can be opened by collaborators, and the layout is flexible enough to organize most datasets. Consider an example data set consisting of time series of temperature and salinity:

This is a very natural format for such data, but unfortunately such spreadsheets can quickly get away from us as datasets expand to encompass longer durations and more features (e.g. Chlorophyll, pH, etc). Many of my own spreadsheets have become just such monstrosities covering thousands of rows and hundreds of columns. It is no wonder that sifting through these piles for a single set of numbers of a single observation has become so time consuming and frustrating.

Yet one of the worst parts of our current spreadsheet strategy comes from attempting to condense our datasets into trimmed down versions. Consider a dataset such as the one depicted above except instead of a half dozen time-stamps it extends to ten thousand. Perhaps I only care about the average value over an hour so I do the reasonable thing, add two new columns called “Avg T” and “Avg S” and type in the appropriate formula. The result is something like this:

This is the point where the spreadsheet starts to fall apart. We just added two mostly-empty rows. Not only is this a recipe for extending the number of spreadsheet rows ad infinitum, but highlights the single worst aspect about how we traditionally organize our data.

The two examples above are both examples of what is termed a WIDE data format where the primary descriptors are written along the top row and the data themselves lie on subsequent rows. While this is intuitively how the data is actually recorded, it makes it very difficult to merge datasets or to generalize existing sets. What I propose should be standard in organizing data[1] is the LONG data format. The LONG data format, as it name implies, uses fewer columns at the expense of addition rows. The top row consists of classes of information with features and data written below. Here is an example of the same data as above in the LONG format:

Notice how instead of adding a new column for a new measurement feature (e.g. pH) we simply add new rows. While this LONG format certainly results in very long spreadsheets, indeed, it makes searching, slicing-up, and summarizing the data much more natural. With a data-structure like above, the powers of parametric searching become available so that, for example, one could search an extensive spreadsheet for all the Chlorophyll concentration data from greater than 5 meters depth in the month of August of any year. Such searching is well established and extensively used in millions of databases across the globe wherein they use the LONG data format as well.

The LONG format certainly isn’t perfect either, and depending on what the data is actually going to be used for, i.e. analysis, the WIDE format may well be preferable. I would argue however that for data organization and long term storage (e.g. when building a personal dataset for, say, a dissertation), the LONG format should be preferred for the search functionality alone, if nothing else.

After an extensive search I have been unable to find any piece of software which can take existing spreadsheets and through some wizard or algorithmic magic merge existing spreadsheets into a [LONG form] database[2]. Imagine being able to take the plethora of spreadsheets on your hard drive and run them through some software where you would select which columns are valuable metadata, which are features and which are values and ending up with a completely searchable, slice-able and manageable database. Such a database could be shared amongst a lab group, a department or with colleagues and the community at large with very little effort. There is also no reason why the more common WIDE format spreadsheets could not be directly exported from the database with the results of your search (e.g. A spreadsheet of all the various iron profiles from all the cruises in the Atlantic north of 30 degrees). Such a tool would be almost too good to be true.

Let me know if you know of how you manage and organize your data, especially if you have formed a personal, scientific database. I would love to hear ideas, possible alternatives, or solutions that you’ve come up with to manage your torrent of data.

Note that I am arguing for the LONG format in the organization of data, not necessarily in the analysis of data.
Most programming languages have functions that can perform such a function, such as the reshape library in R, but such functions are a long way off from being a practical, user-friendly pipeline for use by the average scientist. In fact, wrapping existing functions into a intelligent wizards is both the most important and artistic aspect to such a tool.