10 gotchas of Big Data

We’re being inundated with messages about the future of Big Data. We’re almost halfway through 2012 and we can rest assured that Big Data is the buzz phrase for the year. Hype can be good, as it helps to focus the marketplace on great new ways of doing business even before the results are in. But hype can be bad because it can lead to cynicism when it comes too early or doesn’t play out as expected. Big Data has the potential to play out either way. We need to invest thought in how to make this paradigm work well without excessive risk.

I was asked recently what cautions I would offer to those feeling over-hyped and under-informed. The truth is that Big Data done well can mean a challenging implementation. It would be easy to apply technology and process in ways that aren’t optimal. With that in mind, these are my 10 cautions of Big Data:

  • Sizing – Many problems that appear to be Big Data challenges are really Medium Data problems that can be solved more easily by other technologies. An example would be loyalty data for a retailer with a single-digit number of stores. Transactional data may seem large, but doesn’t scale to be truly Big Data.
  • Lifecycle – Great data management means understanding when, why and how data arrives and ensuring that it stays fresh. Many systems move data through multiple transformations and clean up using business rules. If these moving pieces aren’t meshing accurately or in the right time frames, people won’t trust the system to give quality answers. Systems that aren’t trusted aren’t used.
  • Latency – Movement of data can take more time than processing if network bandwidth and I/O aren’t adequately factored into the solution. This amplifies in the era of Cloud computing. If data movement takes 24-48 hours, your hourly report won’t hold much value.
  • Neglect – Too often, data formats and source systems change but a Big Data system that pulls that data isn’t kept up to date. Is your process robust enough to pick up change automatically? If not, bad results are inevitable. Look no further than many ERP systems to see this in play.
  • Security – We’re in a frenzy to get data into a Big Data environments while the tools to restrict access to subsets of data are still primitive. Can people generate reports on data they shouldn’t see? There is no product on the horizon to solve this and the only current solution is to bake it your data design. If you think this isn’t a big problem, just wait until the first well-publicized Hadoop data security lapse happens.
  • Skills – Does it take a quasi-engineer, ‘mad conductor’ to operate a Big Data environment? High levels of dependence on specialized skills become an operational risk. Are you willing to bet everything on the strengths (and weaknesses) of very few people?
  • Tolerance – The system needs to tolerate some level of errors and can’t stop simply because a few records aren’t good. Decisions about what to ignore, what’s good enough, and what needs outlier processing are key. Sometime a blank page on a report is preferable to stopping the output of a 100-page report.
  • Complexity – With all of the moving parts, it is very easy to build a system that can’t be operated. Perfect answers could require a level of complexity that build in failure points and production down times. Users will quickly tire of repeated outages.
  • Big Process – Do you have a channel that can get a promotion or coupon into the system quickly enough? Organizations need to ‘pull the lever’ at a speed for Big Data recommendations to matter.
  • Reliability – If data will cause people to do things differently and money to be spent or withheld, it has to be very carefully considered. A few poor choices will create a level of cynicism that may not allow the organization to wait around to get things right.

The only way to avoid the gotchas is to take a thoughtful approach to Big Data. Rather than letting the hype allow faulty decision making, take the time to assess where you are, what you want to accomplish, and then go about doing it wisely. We’ve found success by recommending a short period of time, typically two weeks, in the beginning to go through the potential gotchas and to come up with the best plan possible to maximize the Big Data project.