Interesting Read on Hadoop Math

  • January 20, 2010
  • Scott

Ok ok, what could possibly be interesting about Hadoop-based systems and Mathematics?

Well, it sounds fancier to say Hadoop-based system… but really this basic math applies to any batch-oriented system and those of us who have been writing batch processing solutions now-and-then for the last 20 years should at least be intuitively aware of the math presented in this article, if not consciously thinking about it at design time.

The key equation:

Runtime = Overhead / (1 – {Time to process one hour of data})

Or, stated differently:

Runtime = Overhead + {Time to process one hour of data} * {Hours of data}

Where hours of data and runtime are equal.  These equations help explain why a perfectly healthy batch processing system can suddenly fall tragically behind – if the time to process an hour’s worth of new information is greater than an hour, you have a problem – and the problem will just keep getting worse until you:

  • Improve the runtime of the algorithm
  • Apply more resources to your server / cluster.
  • Filter the incoming data better (if possible) to improve your signal to noise ratio and thereby eliminate unnecessary data processing

In the BPM and CEP worlds, often that third bullet is a key element to improving performance – it doesn’t require more hardware – it just requires you to move your filter “upstream” from your BPM infrastructure to your EAI infrastructure or your ESB infrastructure… or from your EAI/ESB infrastructure to the source of the noise… Some would say this is squeezing the balloon, moving the bottleneck elsewhere, but actually, filtering better up stream may make those systems more efficient as well (if generating the payload and calling out to a webservice or ESB requires cycles, then not doing it as a result of an efficient filter may well reduce processing cycles… )

At any rate, its a good read.  Andrew Paier, figured you especially would get a kick out of this article, given our experience back in 2003…

Related Posts
  • August 27, 2019
  • Scott

Camunda continues to advance the state of the art for distributed workflow with the first production release o...

  • April 16, 2019
  • Lance

If you have been thinking about automation or performing small pilots, now is the time to get very serious abo...

  • April 6, 2019
  • Scott

Pre-amble.  A couple of weeks ago I attended the IBM Think conference in San Francisco. It was the first ti...