Big Data: Generation Next

The following was first published in Analytics Magazine. Dr. Vijay Mehrotra is an associate professor, Department of Finance and Quantitative Analytics, School of Business and Professional Studies, University of San Francisco.

We have all been hearing about both the “the analytics revolution” and “the rise of Big Data” forever, or so it seems. I credit the book “Competing on Analytics” by Thomas H. Davenport and Jeanne G. Harris [Harvard Business School Press, 2007] with making “analytics” part of the mainstream business lexicon. Similarly, the McKinsey Global Institute (MGI) report entitled “Big Data: The next frontier for innovation, competition and productivity,” released in May 2011, has had the same effect for the term “Big Data.”

This MGI report formally defined Big Data as “datasets whose size is beyond the ability of typical database software tools to capture, manage and analyze,” while also identifying several vertical industries and classes of applications that can be improved by intelligent use of data for better decision-making, innovation and competitive advantage. In fact, many of the broad themes presented in this report echo the ideas presented by Davenport and Harris in “Competing on Analytics.” As such, over the past year, it has become natural to think of “analytics” and “Big Data” as being virtually synonymous with one another.

I caught up with Davenport by phone a couple of weeks ago. He was in the midst of a study on the human side of Big Data sponsored by SAS Institute and EMC/Green Plum, and he was kind enough to share some of his findings with me. Over the past few months, he had interviewed a large number of data scientists who were working in Big Data roles in an effort to understand who they are, where they are working and what they are working on. I found some of his observations insightful and others more surprising.

The data scientists who Davenport had spoken with had academic backgrounds in many different disciplines including physics, mathematics, computer science, statistics and operations research, as well as less obvious ones such as meteorology, ecology and several social science fields. Almost all had Ph.D.s, and in many cases their research had been a catalyst for the development of their deep data skills (Davenport cited one recent Ph.D. cohort of seven applied ecology students, of whom six had launched careers in Big Data, rather than academia, after finishing graduate school).

More surprising, however, was Davenport’s observation that “very few large companies are going to bother with ‘first generation’ data scientists.” While pointing to General Electric as a notable exception, he noted that the vast majority of the data scientists who he had found worked at platform companies such as Facebook, Twitter, Google, Yahoo and LinkedIn and at startup companies such as Splunk [1] see exciting entrepreneurial opportunities [2] in creating tools to enable more efficient access, visualization and mining of large streams of data from multiple sources.

“Data management seems to dominate the world of Big Data right now,” Davenport explained. “There’s a huge focus on visualization and reporting among the data scientists I talked to. The statisticians are a little bit frustrated … One of the quips I heard was, ‘Big Data = Little Math.’ ”

His conclusion: today, data-driven managerial decision-making still relies almost exclusively on small-to-medium sized datasets stored in traditional data structures.

I heard some of these same themes at the recent INFORMS Analytics Conference, most notably in a panel discussion on “Innovation and Big Data.” The panelists included Diego Klabjan (Northwestern University), Thomas Olavson (Google), Blake Johnson (Stanford University), Daniel Graham (Teradata) and Michael Zeller (Zementis, Inc).

Very early in the discussion, the panelists all agreed that there’s a huge amount of confusion about what is actually happening in this space today, and that this confusion is being amped up by the massive amount of hype about Big Data (a recent Google search on “Big Data” returns a cool 1,350,000,000 entries, and a quick query on Google Insights for Search reveals that the number of people searching on this term has grown exponentially in the past year [3]). However, as Northwestern’s Klabjan bluntly stated, “OK, with Hadoop we know how to store Big Data. But doing analytics on top of Big Data? We have a long way to go.”

The discussion often touched on the “volume, velocity and variety” [4] of today’s data and the accompanying high level of complexity that leads to a variety of challenges in extracting value from it. Teradata’s Graham acknowledged these risks explicitly when he encouraged executives in the audience to (in the words of Tom Peters) “fail forward fast,” while Google’s Olavson urged the audience to not get so caught up in the complexity of the data challenges and the power of the data management solutions that the key business problems slip out of sight.

The panelists often came back around to the human side of Big Data. Zementis’ Zeller envisioned a future in which the work done by the data scientist of today is broken up into a variety of emerging roles such as data technician and data analyst, while Stanford’s Johnson suggested that the democratization of data would create a need for a quality assurance function for not only the expanding mounds of data but also for the analytic models built on top of it. And Olavson’s final comment was that with or without Big Data, analytics is ultimately about enabling smart people to use data and tools to create business value.

Which brings me back to my earlier conversation with Davenport. At several points in our discussion, he drew a clear distinction between the data scientists of today and the “second generation” of tomorrow. Based on his research, Davenport anticipates that “as more and better data management tools come to market, less software development will be needed to work with Big Data.” In this world, a combination of large, unstructured data management skills and analytic modeling capabilities will be a powerful combination.
It will, I suspect, be here before we know it.

Vijay Mehrotra (, senior INFORMS member and chair of the ORMS Today and Analytics Committee for INFORMS, is an associate professor, Department of Finance and Quantitative Analytics, School of Business and Professional Studies, University of San Francisco. He is also an experienced analytics consultant and entrepreneur and an angel investor in several successful analytics companies.


  1. To read about Splunk’s recent successful IPO, see
  2. See for example
  3. See
  4. The three Vs are a popular foundation for Big Data – for more background on this, see

Big data’s big requirements

There is no shortage of information on how to use parts of the most common big data solutions, like Hadoop. But what about the other pieces of the puzzle necessary to get real business value from this technology? For starters, there is a need to make decisions around:

  • Mobile strategy and its support
  • Web delivery
  • User interactivity/experience
  • Data support and operations
  • Security
  • Storage (for both big data and traditional SQL)
  • Scalability
  • Revenue Generation models
  • Filtering knowledge and noise
  • Integration into existing applications and processes

For these areas, there is less information available and just as important a need.

Fragmented solutions

In reality, there isn’t a single application development platform that covers a full solution. There are instead many choices, each having tradeoffs in usability and scalability.


Also, there are solutions that have already come and gone in the short time big data has been in vogue. The question arises, “How does one now what will be around and still supported in two years’ time?”  Predicting the future popularity and support for the many available tools is a significant challenge.


Open Source is an excellent way to ramp quickly and cheaply, but the solutions aren’t necessarily as mature as market requirements. As things stand today, it would be easy to get a few months into development of a solution before a particular tool’s shortcomings become apparent.  A great example would be that basic features like multi-language support are missing from some of the common solutions. Some lack authentication capabilities.

The UI

Lastly, user interfaces are no longer a common part of the equation. Less investment has been made in UI technologies in the haste to bring back-end capabilities to market. Avoiding these problems involves having enough knowledge of the space to make sound choices.

Big Data means broad solutions to complex problems. There are enormous opportunities ahead for those who consider the ecosystem beyond the big names, like Hadoop. 

Big Data for the Rest of Us

The following was first posted on Harvard Business Review.

9:31 AM Tuesday May 29, 2012
by Chris Taylor

The hype around Big Data is growing to deafening proportions, fueled by the prospect that tools now exist that can let small businesses reap the benefits that companies like Google, Amazon, and Facebook so obviously enjoy from mining vast quantities of all sorts of data.

But is that so? The answer is well, yes, kind of — though probably not as simply and easily as many vendors might like you to think. Small businesses are now testing the waters, and their early experience is already shedding light on what challenges the rest of us need to consider before taking the plunge.

One such is VinoEno, a San Francisco-based wine-recommendation start-up founded by Kevin Bersofsky. The wine industry, Bersofsky says, has been ruled by small data for the longest time. People have had very little to go on to work out whether a bottle of wine is good or not — really just the opinion of a handful of wine reviewers, whose descriptions may or may not be tacked up on the racks in the local liquor store. “You can’t get much smaller — or flawed — than one person giving a score, the Wine Spectator model, that drives an entire industry.” Such limited data, Bersofsky knew, made people very uncomfortable spending even $20 on a bottle.

Bersofsky wants to give consumers a better way to decide what wine to buy — a recommendation engine that can match people’s various tastes to the myriad attributes of various wines. He’s envisioned that he could market such a personalized wine-recommendation engine to wine retailers, to be placed in a kiosk in a store or accessed from an iPad, a personal mobile device, or as a plug-in from the retailer’s Web site.

So the VinoEno staffers have set out to build a system to collect and combine wine sensory attribute data and consumer preferences data, to determine a consumer-specific recommendation. Ultimately, they could see, the project would potentially require the collection of massive amounts of data, much iterative work to develop effective recommendation algorithms, some way to validate consumer preferences, and a lot of experimentation to develop a rewarding end-user experience. A big challenge, and they soon found out that they needed help.

  • The first challenge was to find talented people who know Big Data and analytics. Google might be able to attract armies of top-flight data analytics people for large-scale number crunching, but chances are that a small business is not going to be able to build its own analytics solution, at least not without help from a data analytics vendor. After unsuccessfully trying to hire in a very talent-constrained market, VinoEno quickly turned to an outside provider, Fabless Labs. In selecting Fabless, VinoEno was looking for an experienced partner that was not too set in its ways — one that was willing to experiment with solutions more suitable for small operations and not be unduly influenced by approaches that had worked for its deeper-pocketed clients.
  • The second challenge was to decide what tools to use. Here, of course, VinoEno’s people were clueless, depending on Fabless Labs to know which big-data collection and analysis tools would be powerful enough to handle the volume, velocity, and variety of data they’d be working with but also simple enough for them, as non-techy business users, to employ and maintain on their own. “We needed to be able to create and use data that didn’t exist, which was both exciting and scary,” says Bersofsky. VinoEno also depended on its vendor to teach its small, non-technical staff how to handle all that data — about how to implement a cloud strategy, how to move data efficiently, how many data points would make a mathematical model work, data cleanliness requirements, and how to test market the concept.
  • The third challenge was to decide what types of information matter. What kind of information is worth the cost of collecting it? Should VinoEno be trying to match customers to various attributes of wine? Should it be trying to keep track of which groups of people buy what kinds of wine? Should it include Wine Spectator information: Even though it was the competition, could it be dismissed? In such uncharted territory, the questions simply multiplied. “To be truthful,” Bersofsky says, “we’re still working out what will ultimately solve our problem, and only trial and error will tell us. With online content, you can watch 30 movies per month and build data points quickly. With wine, the work has never been done before.”
  • The fourth challenge was to remain open-minded. As VinoEno’s staffers developed their application, they had to learn to avoid thinking they already knew the answers. As an early example, they had always believed that people’s flavor preferences would correlate with a wine’s many attributes. But the analytical results suggested that only a far smaller set of attributes matter (and the lack of negative attributes like astringency or burning). It was hard to accept that so many traditional attributes like oakiness or fruitness really don’t figure into people’s buying decisions at all. But if they really were going to learn something new, they simply had to let go of old ideas. After all, Bersofsky points out, “How many times has someone found a radical conclusion on the way to looking for something else? When 3M invented the Post-It Note, they were looking for something entirely different. If we stay glued to our conviction, we’ll miss other sign posts along the road.”
  • The fifth challenge was to spot the finish line. VinoEno’s founders needed to define what overall success for the recommendation app really meant. This turned out to be more of an art than a science, a combination of trial and error and gut feel for what a good recommendation would eventually be for an individual consumer. “We had no way to validate the result and no one to confirm the recommendation,” Bersofsky explains. “We decided the secret to making the answer valid was to promote the result as the best possible answer available based on the trial data, especially to the sensory scientists on the team who struggled to buy in.”

Wine consumers will be pleased to know that the first generation of the engine is now complete. With Fabless Labs’ help, VinoEno got it up in just three months and for far less than the estimated $250,000 it would have cost to hire a dedicated team. VinoEno is currently test-marketing VinSpin, the first iteration of the engine, and the proof will be in the pudding — in this case, the consumers’ perception of the value of a recommendation.

That’s one company’s story. I’d be interested to hear from others. Have you had a different experience with big data? What would be your recommendations to those just starting down this path?

Big Data beyond the hype cycle

Big Data is more than hot. It is one of the most talked about phenomenas of the past year and will continue to be the hot topic going forward. Just like social media, there is enormous pressure on organizations to get into the Big Data game and quickly. Beyond the excitement and anxiety, there are reasons you should slow down and think about what you want to do.


Environments are complex, requiring organizations to seek technology that is plug and play and can stand up easily in diverse infrastructures. The current status of the technology market could be described as many tools that are equally complex to understand, deploy and use. Standing them up has nuances that anyone considering a Big Data solution should understand first. The nuances fall into three categories; resources, timelines and tools.


Most new data analytics technologies were created for developers and require java skills or SQL experience. The traditional data scientist who understands data modeling, on the other hand, doesn’t come from a coding background and can’t access the data they’d like to analyze. Those data integration skills lie on one side of the technical fence and the data knowledge on the other. I guess you could say data scientists are from Mars, developers are from Venus.


Data science is a challenging field. Data scientists are used to writing algorithms for others to develop, test and implement. The traditional cycle for doing that was six months or more in most industries. This waterfall approach is methodical but takes too long to stand up. The world can change in six months. Time to market is both a barrier to getting started and a competitive differentiator if you can shorten it.


Pieces of the solution exists. First and foremost, there is Hadoop, the premier product for distributed computing, which involves shuffling jobs between servers to run large scale analytics. Hadoop solves the problem of storage and parallel processing in an elegant way. While Hadoop is the rallying point for Big Data, by itself it isn’t a solution. It sometimes seems like the solution because when data gets large, there is nothing that can replace Hadoop. There’s a real expectation gap, however, between the engine that is Hadoop and the drive train that is required to do useful things with Big Data.

So what have companies done to address these issues? That’s another story.

10 gotchas of Big Data

We’re being inundated with messages about the future of Big Data. We’re almost halfway through 2012 and we can rest assured that Big Data is the buzz phrase for the year. Hype can be good, as it helps to focus the marketplace on great new ways of doing business even before the results are in. But hype can be bad because it can lead to cynicism when it comes too early or doesn’t play out as expected. Big Data has the potential to play out either way. We need to invest thought in how to make this paradigm work well without excessive risk.

I was asked recently what cautions I would offer to those feeling over-hyped and under-informed. The truth is that Big Data done well can mean a challenging implementation. It would be easy to apply technology and process in ways that aren’t optimal. With that in mind, these are my 10 cautions of Big Data:

  • Sizing – Many problems that appear to be Big Data challenges are really Medium Data problems that can be solved more easily by other technologies. An example would be loyalty data for a retailer with a single-digit number of stores. Transactional data may seem large, but doesn’t scale to be truly Big Data.
  • Lifecycle – Great data management means understanding when, why and how data arrives and ensuring that it stays fresh. Many systems move data through multiple transformations and clean up using business rules. If these moving pieces aren’t meshing accurately or in the right time frames, people won’t trust the system to give quality answers. Systems that aren’t trusted aren’t used.
  • Latency – Movement of data can take more time than processing if network bandwidth and I/O aren’t adequately factored into the solution. This amplifies in the era of Cloud computing. If data movement takes 24-48 hours, your hourly report won’t hold much value.
  • Neglect – Too often, data formats and source systems change but a Big Data system that pulls that data isn’t kept up to date. Is your process robust enough to pick up change automatically? If not, bad results are inevitable. Look no further than many ERP systems to see this in play.
  • Security – We’re in a frenzy to get data into a Big Data environments while the tools to restrict access to subsets of data are still primitive. Can people generate reports on data they shouldn’t see? There is no product on the horizon to solve this and the only current solution is to bake it your data design. If you think this isn’t a big problem, just wait until the first well-publicized Hadoop data security lapse happens.
  • Skills – Does it take a quasi-engineer, ‘mad conductor’ to operate a Big Data environment? High levels of dependence on specialized skills become an operational risk. Are you willing to bet everything on the strengths (and weaknesses) of very few people?
  • Tolerance – The system needs to tolerate some level of errors and can’t stop simply because a few records aren’t good. Decisions about what to ignore, what’s good enough, and what needs outlier processing are key. Sometime a blank page on a report is preferable to stopping the output of a 100-page report.
  • Complexity – With all of the moving parts, it is very easy to build a system that can’t be operated. Perfect answers could require a level of complexity that build in failure points and production down times. Users will quickly tire of repeated outages.
  • Big Process – Do you have a channel that can get a promotion or coupon into the system quickly enough? Organizations need to ‘pull the lever’ at a speed for Big Data recommendations to matter.
  • Reliability – If data will cause people to do things differently and money to be spent or withheld, it has to be very carefully considered. A few poor choices will create a level of cynicism that may not allow the organization to wait around to get things right.

The only way to avoid the gotchas is to take a thoughtful approach to Big Data. Rather than letting the hype allow faulty decision making, take the time to assess where you are, what you want to accomplish, and then go about doing it wisely. We’ve found success by recommending a short period of time, typically two weeks, in the beginning to go through the potential gotchas and to come up with the best plan possible to maximize the Big Data project.

Big Data is the greenest data of all

Green is the new black…or so you’d think from the incredible amount of focus paid on efficient energy production and consumption. With so much emphasis on building a green planet and increased government and utility spending on energy efficiency, one would expect a reduction in total energy spend in commercial and residential markets. It hasn’t happened.

Despite the recent push to implement energy efficiency programs, total energy consumption and average energy prices in both markets continues to rise.  Consumers continue to face higher energy bills because the average energy price is rising faster than the reduction in customer demand.  These indicators might make you think we’re going in the wrong direction, but it is actually too early to say all of this effort isn’t working.

Give it time

The reason for optimism? Smart meters and a variety of sensors across the energy supply chain are creating the ability to collect and analyze massive amounts of energy production, transmission and consumption data. The arrival of Hadoop and other Big Data tools makes it possible for analysis to keep up with rapidly increasing data volumes. All of that means nothing if it isn’t actionable. Let’s take a look at just a couple of ways that Big Data can be Green Data.

  • Forecasting demand – We’re slowly moving toward homes and businesses having smart meters that report actual consumption back to utilities and allow decisions on how to supply energy more efficiently. Right now, those meters provide data in 15-minute increments but down the road, we can increase the frequency of reporting as the ability to take and crunch data increases. When we know more, we make better energy supply decisions.
  • Conservation ‘signals’ – To make markets behave efficiently, there needs to be a way for energy use to change with availability. The most common way this is done is through price. Once energy providers can accurately forecast demand and in-the-moment use, pricing signals will cause consumers, individual or commercial, to lower usage. This proactive approach is mostly missing from today’s energy markets. When we know more, we make more efficient usage decisions.
  • Measuring efficiency – The US Department of Energy is currently developing the SEED database, meant to be a way to allow buildings to benchmark against other facilities. While that may not seem so hard, there are big factors that need to be part of any efficiency algorithm, like weather, the number of people, what types of machines are being operated and when. Once buildings have an ‘energy value’, decisions can be made on how much real estate is worth, where to retrofit and how to design new facilities. When we know more, we can make better design decisions.

Change needs to happen

This is all great and in theory, will make our planet better for everyone. There are still a few things, however, that stand in the way. There needs to be better standards for how information is collected, stored and shared so that energy supply chains can be better analyzed and operated optimally. We also need to make sure energy providers aren’t reaping excessive benefit from more efficient consumption without passing that benefit to energy consumers, which would ‘mute’ the signals that drive positive change. Balancing the right amount of regulation with allowing market forces to operate is an age-old challenge.

Assuming we’ll overcome these challenges, where things go is wide open. There are gamification possibilities (who’s the most green in the neighborhood/city/region?) and countless other strategies on the horizon. There’s no doubt Big Data will be driving us toward a greener planet.

When the answer is Small Data

Big Data is advertised as the secret to unlocking actionable intelligence. Collecting and sifting through vast amounts of data finds the patterns that change everything. But is elusive ‘data in combination’ the answer that we should expect from analytics? Not necessarily.

More and more often, crunching large amounts of data gets to the opposite result: The answer to many questions is found in far less data than expected. Looking at what’s being answered by large-scale analytics today, the patterns that are emerging often show surprising results like:

  • A clothing retailer discovers that fit matters more than color, or vice versa
  • A wine recommendation engine proves that color matters more than most other attributes, but only when a customer is an occasional wine drinker
  • Only the three most recent transactions show a customer’s preferences and not their composite shopping history

Does that mean that Big Data itself is an overreaching goal for organizations? No. To understand that few factors matter, large data sets need to be created and analyzed. A Small Data answer still requires validation through data that often has velocity, volume and variety. Knowing for sure that Small Data is the answer is just as tricky, and maybe more so.

If we’re not careful, assuming complexity can blind us to the fact that simplicity is the real answer.

The key to today’s Big Data capabilities is to have an open mind and be ready for the answer that you don’t expect.

The fun of not knowing the answer

The best part of the current startup landscape is that we have no way of knowing what will and won’t work. In fact, the situation is the same for established organizations. Between social, mobile, cloud and an Internet that now reaches billions of people, there is enormous change on the horizon.

We know from recent history that seemingly crazy ideas will break through and what seems like a safe bet will go nowhere. That’s the beauty and terror of the rapid changes we’re seeing.

Given this uncertainly, how does a small startup go from ‘nowhere’ to ‘now here’? (Love Guru reference for non-movie-buffs) How does an established company shift to meet a changing world?

Stay nimble

The first idea can often be just the precursor to the breakthrough. Look no further than Flickr, which set out to create a way to photo share as part of gaming. What they stumbled upon with photo sharing dwarfed the original plan in both creativity and financial value. What matters most about this story is that the founders were willing to see the market for their ‘accidental product’ and change gears and course.

Nimble companies change direction when the cues dictate.

Fail fast, fail cheaply

The ability to get to a great idea can require several attempts at products or services that may not work out. There are countless stories of inventors who found success on their 10,000th attempt, but that’s not the point. Get ideas out quickly and as painlessly as possible so that the good one comes to the surface sooner. The longer an idea takes to develop, the more costly and higher risk it becomes. We cherish the things that have taken our biggest investment, our ‘babies’, which can easily blind us to whether that investment was a good idea or not.

While on the topic…reward those who fail fast and don’t punish willingness to try out an idea. You’d be getting rid of your innovators.

Focus on the important things

What matters most is that the idea has market value and that you have the people to realize the vision. To that end, build a smart, creative team and avoid turnover. The longer you work to solve a problem together, the better you’ll get at it. The team will become experts at moving an idea from inception to market and will get faster and better each time.

Unless you’re one of the few who has unlimited funding (and therefore, time) and a first, perfectly conceived idea, your moves will need to follow these patterns to be successful.

Sure, there’s lots more advice about how to create or change your business. I would argue that this is the core of the problem…this is the hard stuff.