What I think lurks behind the Big Data meme
Some buzzwords can be annoying. Regardless of what your opinion is about the Big Data meme, you'd be hard-pressed to ignore the massive mindshare the term has gained. I think our industry is in danger of getting lost in the rhetoric and chatter, instead of remaining laser-focused on the actual problems that need solving. I'd like to share my personal opinion on what I think deserves our attention, why I think it's important, and how I go about navigating Big Data conversations.
When it comes to Big Data definitions, I really enjoy Gartner's. It says Big Data "is the term adopted by the market to describe extreme information management and processing issues which exceed the capability of traditional information technology along one or multiple dimensions to support the use of the information assets." This statement is ambiguous enough to allow anybody with a problem they can't solve with what they have can feel they have a Big Data problem. And in my experience, this is mostly what ends up happening. Pain points about availability, resiliency, distribution, latency, and incremental aggregation are somehow bucketed behind Big Data. And then you see people trying to understand how they'll leverage a batch-oriented technique (which scales and distributes very easily) to deal with computations that are better served by a continuous event-processor, a well-tuned database, a non-relational indexed store, or a combination of all of them.
You see, I'm skeptical of the Big Data silver bullet. In fact, when a conversation starts gravitating towards a single-system solution, I feel we're failing to understand the problem(s) a customer may actually be trying to solve.
The way I see it, the term is not helping us have a discussion about the actual business and technology problems enterprises of all sizes are now worrying about. The size, cost, and volume of today's problems are actually making the bulk of enterprise customers face a new design challenge, in which they truly must develop distributed, heterogeneous solutions. To me, this is what truly lurks behind these conversations: businesses of not-so-massive size must now build a new class of systems for which they may be ill-equipped. How many companies can comfortably claim a decade or more of distributed systems know-how?
Try this: Next time you're in a Big Data conversation, substitute "Distributed System" for "Big Data". Chances are by doing so you'll realize these problems are fortunately not new. You'll suddenly recognize that our field has decades of research into this space, ready to be invoked when appropriate. All of the sudden, you'll start weaving through the actual "why is this hard" questions, instead of the "what this could mean" discussions.
With this in mind, also remember economics, of course, play an immensely important role here. It is unlikely that most companies will build their own datacenters. It is unlikely they will develop their own frameworks and implementations of the actor model to encapsulate the necessary primitives to build these new class of systems. It is also unlikely they will find a turnkey solution for their specific problem. It is, however, very likely they will find platforms and ecosystems that provide most of the pieces required to run this new class of businesses. And the last part is where I think the discussions should be focused: what are those pieces? What are the tradeoffs? How do they integrate with existing investments? How do they scale out gracefully? And more importantly, how can a business use those pieces and start on a happy path from day one?
I know distributed systems programming is tough. I know it doesn't sound sexy. But I think that's the hard problem: the current set of business requirements is best served by this class of applications, which most of us are currently ill-equipped to write. To me, that's the real problem we should be tackling, instead of getting lost in the buzz.