Paper of Note: Re-designing Distance Functions and Distance-Based Applications for High Dimensional Data

While it may be argued–and successfully so–that this is an article pertaining to an esoteric subset of computer science that few people will ever find practically useful, you may actually find it quite intriguing. I’ve found the article quite eye-opening and I’m certainly no computer scientist. What this article does do, which I feel is critically important for the progress of science and for the interaction of science with society, is unveil a new perspective on something that we do deal with on a day to day basis. What could be more mundane than distance?

It’s hard to imagine anyone why would not instantly recognize the following equation (1), and few would take issue with it’s natural extension below it (2).

Untitled

The first equation is, of course, the pythagorean theorem for the length of the hypotenuse for a Screen Shot 2015-04-17 at 19.08.44right triangle. The second is the distance formula, something many of us (don’t) remember from geometry class. In virtually all instances when we talk about the distance we are thinking about equation 2 (give or take some modifications) or the result of equation 2. This equation forms the basis of how we calculate distances between points of latitude and longitude, about changes in stock prices and even how we talk about the weather[1].

So where does this paper come in?

As it turns out, once you start dealing with more than 2 or 3 dimensions (think x, y, z), the concept of distance doesn’t work quite the same way. Many technological advances rely on being able to organize, summarize and group data into more manageable groups and this process is called clustering or indexing[2]. Just imagine what it would be like if you couldn’t search Google for a term and instead had to visit page after page to find what you were looking for.

It would take you quite a long time to find any relevant page and it almost certainly wouldn’t be the one you needed. Instead Google provides an invaluable service by grouping keywords and search terms in an almost magical way, don’t you agree? This sort of feat is essentially the same as finding the closest neighbor to a particular point (i.e. the closets websites to your search term are more relevant compared to the sites that are a great distance away); but as it turns out, with high dimensional data (like search terms) the difference between the closest neighbor and the furthest neighbor become quite smaller. In essence, imagine searching for the weather and getting these results: a page about a German village, the stock price of Monsanto and a blog on how to resurface a patio. Those websites were not quite what you were working for, but in many ways they are not ‘far’ from the pages you were looking for.

This is the curse of high dimensionality and it requires a new way of thinking. Mathematically, distance becomes less meaningful as the dimensionally becomes greater, and that is the point of this paper. The paper does a good job explaining the impact and importance of having a good way to measure, but the best part I found was the way it challenged my own way of thinking. It took something as simple as distance and showed how limited my intuition and understanding actually were.

Science and society only progress when our preconceived ideas and everyday conventions get challenged. It forces us all to adapt and evolve to new ideas and new frontiers.

Citation

Aggarwal, C.C., Hinneburg, A. & Keim, D. a., 2001. Re-designing distance functions and distance-based applications for high dimensional data. Database Theory – ICDT 2001, 434(1), pp.13–18.


Notes

  1. Any time you fit a trend to data, distance is a significant factor in how that line is drawn.
  2. Yeah, I know this is a lie. These terms are fundamentally distinct.