Computer Science R Programming Science

Retrieving data from the bottom of the ocean: OCR for ROV

When advising students about their career goals, paths forward, and expectations; I often recommend that they consider learning a program language. While the language itself depends on the goals and background of the person, being able to work directly with data is powerful. Today’s project is a good example of just such a case.

A colleague approached me to see if I could help pull data off some video files since the original data had been lost. For context, these video files were collected as part of a deep-sea survey of marine organisms with many hours of footage collected by a remotely operated vehicle (ROV). Attached to the ROV are a number of sensors besides the camera–such as pressure, temperature, and salinity; and it was some of those sensor feeds that are lost to time. Luckily for us, the data feeds were directly overlaid into the feed feed. So all we need to do is pull some numbers from the video, how hard could that be?

Here is a short clip from one of the deployments.

I figured this should only really entail three steps:

  1. Crop out the text boxes from each video frame
  2. Use optical character recgonition (OCR) to read the text
  3. Parse, organize, and otherwise filter the text data into a useful file such as a spreadsheet

For this project I’m using my daily-driver programming language, R, but you could use almost any language for this.

To pull out the video frames and perform the cropping, I decided to use ffmpeg, an open source and ubiquitous tool for all things video. In this first snippet I set the path to my source video, create a temporary directory to store the output frames and crops, and then call ffmpeg from the command line using the R system() command:

R
in.file = '00001.MTS'
tmp.dir = paste0(tempdir(), '\\ocr\\')

if (!dir.exists(tmp.dir)) {
  dir.create(tmp.dir)
}

## Altimeter
system(paste0('ffmpeg -i "', in.file, '" -filter:v "crop=117:43:1769:39" "', tmp.dir, 'altimeter%05d.png"'))
system(paste0('ffmpeg -i "', in.file, '" -filter:v "crop=117:43:1769:111" "', tmp.dir, 'depth%05d.png"'))
system(paste0('ffmpeg -i "', in.file, '" -filter:v "crop=81:34:1770:197" "', tmp.dir, 'temperature%05d.png"'))
# Pressure
system(paste0('ffmpeg -i "', in.file, '" -filter:v "crop=177:37:1710:268" "', tmp.dir, 'pressure%05d.png"'))
system(paste0('ffmpeg -i "', in.file, '" -filter:v "crop=147:38:1737:344" "', tmp.dir, 'salinity%05d.png"'))
system(paste0('ffmpeg -i "', in.file, '" -filter:v "crop=150:32:1717:988" "', tmp.dir, 'time%05d.png"'))
system(paste0('ffmpeg -i "', in.file, '" -filter:v "crop=100:28:1773:1019" "', tmp.dir, 'date%05d.png"'))

As you can see, my approach was to crop out several small portions of the full frame, one for each data field. I did this because OCR is often more than a bit fickle, and I wanted the text output to be as clean and reliable as possible. By providing only small subsections of the frame I avoid all the issues associated with noise elsewhere in the frame and changes in text size across the frame.

Here are the regions I used to generate the crop filter for ffmpeg above.

The ffmpeg crop filter is very easy to use and is simply: crop=width:height:left:top

The output then looked like this

Now we can apply OCR to each crop. This turns out to be very easy in R due to the tesseract package. I first find all the altimeter files, setup a dataframe to save the information in. Then simply loop through each frame worth of crops, performing OCR, and saving the results as we go.

R
frame = list.files(tmp.dir, pattern = 'altimeter', full.names = F, recursive = F)

result = data.frame(file = frame, frame = NA, altimeter = NA, depth = NA, temperature = NA, pressure = NA, salinity = NA, time = NA, date = NA)
eng = tesseract("eng")

for (i in 1:nrow(result)) {
  result$altimeter[i] = tesseract::ocr(paste0(tmp.dir, result$file[i]), engine = eng)
  result$depth[i] = tesseract::ocr(paste0(tmp.dir, '/', gsub('altimeter', 'depth', result$file[i])), engine = eng)
  result$temperature[i] = tesseract::ocr(paste0(tmp.dir, '/', gsub('altimeter', 'temperature', result$file[i])), engine = eng)
  result$pressure[i] = tesseract::ocr(paste0(tmp.dir, '/', gsub('altimeter', 'pressure', result$file[i])), engine = eng)
  result$salinity[i] = tesseract::ocr(paste0(tmp.dir, '/', gsub('altimeter', 'salinity', result$file[i])), engine = eng)
  result$time[i] = tesseract::ocr(paste0(tmp.dir, '/', gsub('altimeter', 'time', result$file[i])), engine = eng)
  result$date[i] = tesseract::ocr(paste0(tmp.dir, '/', gsub('altimeter', 'date', result$file[i])), engine = eng)
}

This worked flawlessly and only took about 8 minutes for over 9,000 frames worth of crops.

Last but not least, I cleaned up the raw text and binned the data into something useful (in this case we don’t need data for every frame since virtually none of the data feeds change more than about once per second). So instead of ~9,000 data points (30 fps x 300 s), we end up with just ~300 datapoints.

R
result$altimeter = gsub('m\n', '', result$altimeter)
result$depth = gsub('m\n', '', result$depth)
result$temperature = gsub('\n', '', result$temperature)
result$pressure = gsub('dbar\n', '', result$pressure)
result$salinity = gsub('psu\n', '', result$salinity)
result$time = gsub('\n', '', result$time)
result$date = gsub('\n', '', result$date)
result$datetime = as.POSIXct(strptime(paste(result$time, result$date), format = '%I:%M:%S %p %d %b %y'), tz = 'UTC')

result$altimeter = as.numeric(result$altimeter)
result$depth = as.numeric(result$depth)
result$temperature = as.numeric(result$temperature)
result$pressure = as.numeric(result$pressure)
result$salinity = as.numeric(result$salinity)
result$time = NULL
result$date = NULL
result$file = NULL
result$frame = NULL
result$file = NULL

result = result[order(result$datetime),]

## Genearte final output file (by second)
times = unique(result$datetime)
output = data.frame(datetime = times,
                    altimeter = approx(result$datetime, result$altimeter, xout = times, ties = median)$y,
                    depth = approx(result$datetime, result$depth, xout = times, ties = median)$y,
                    temperature = approx(result$datetime, result$temperature, xout = times, ties = median)$y,
                    pressure = approx(result$datetime, result$pressure, xout = times, ties = median)$y,
                    salinity = approx(result$datetime, result$salinity, xout = times, ties = median)$y)

openxlsx::write.xlsx(output, file = 'example.xlsx')

It’s worth stating in case it’s not immediately obvious, none of this process is particularly optimized, and the code might not be well organized overall. But it does work will and is easy to maintain and troubleshoot. Overall this was a great project to learn about the tesseract package.

Ultimately this falls into one of those categories where

  • A. the task is not all that difficult: each frame could be analyzed by hand
  • B. cannot be easily adapted to other tools such as excel or goggle maps
  • C. does not scale quickly enough for a brute force attach to be much fun

Tasks that fulfill all three criteria are great examples of why you need to learn a programing language. I think there’s a common misconception that programing languages are best suited to solve problems that are very difficult, whereas most problems we face are pretty simple and thus “I don’t need to learn a programing language”. The reality is that most problems are simple, and in most cases programing languages can make easy problems easier!

Back To Top