As organizations strive to analyze more data than ever and to do it faster than ever, the results they’re getting might actually be worse than those in the pre-big-data and real-time world — at least temporarily. During a panel discussion of “master data wranglers,” a major topic of conversation was the trade-off between analyzing lots of data fast and taking adequate time to drive real, meaningful results.
According to Abe Taha of Karmasphere, this problem exists in large part because the cost to store and analyze data has become so much cheaper, meaning organizations can actually take advantage of what they’re producing. He points to Hadoop, especially, as having democratized the capability of doing big data. Or, as Metamarkets Co-Founder Michael Driscoll puts it, organizations are suffering from the “attack of the exponentials,” which means that storage and bandwidth have gotten exponentially cheaper, making it feasible to tackle all this new analysis. Ideally, he said, the benefits of analyzing it outweigh the cost of analyzing it. With all that data now on hand and analyzable, said Glenn McDonald of ITA Software, “the expectatuion is that you’ll use it.”
But herein lies the problem, says Theo Schlossnagle of OmniTI. We’re listening to more things, he explained, but we’re not listening any smarter. In fact, he thinks the signal-to-noise ratio is very high, often resulting in worse decisions. However, he added, this might just be a matter of growing pains as organizations learn how to do big data optimally. Until that point, he said, the question is whether timeliness outweighs correctness.
Driscoll calls this situation “analysis paralysis,” citing the example of the CIA suffering through a decade of weakened analytics efforts before finally figuring it out. Very likely, he said, it could get worse before it gets better.
McDonald sees a value in the push to real-time analysis, though, even immediately: it’s much easier to figure out the right questions to ask. If it takes 14 hours to rerun an analysis because some factor was weighted incorrectly or something else went wrong, it’s very difficult to learn from your mistakes. If you can “get in” the data, explore and figure out what’s going on, it’s a lot easier to refine your algorithms, he said. Users must be able to interact with the data.
One thing seemingly everyone agreed on, however, is that we will figure it out, thanks in large part to the same trend that enabled big data: cheap infrastructure. Through options like cloud computing (Driscoll’s company stores about half a petabyte in S3), organizations can afford to take chances they might not otherwise take if it meant spending large amounts on server and storage infrastructure.
Related content from GigaOM Pro (subscription req’d):
I think these can be generalized to all businesses and problems that require big data:
Federal IT leaders are increasingly sharing lessons learned across agencies. But approaches vary from agency to agency.
For a long time each business worked in its own silo.
Yesterday, tools and algorithms represented the competitive advantage. Today the competitive advantage is in data. Sharing algorithms, experience, and ideas is safe.
federal thought leaders across all agencies are confronted with more data from more sources, and a need for more powerful analytic capabilities
If you are not confronted with this problem it is just because you didn’t realize it. If you think single sources of data are good enough, your business might be at risk.
Large-scale distributed analysis over large data sets is often expected to return results almost instantly.
Name a single manager or a business or a problem solver that wouldn’t like to get immediate answers.
Most agencies face challenges that involve combining multiple data sets — some structured, some complex — in order to answer mission questions.
increasingly seeking automated tools, more advanced models and means of leveraging commodity hardware and open source software to conduct distributed analysis over distributed data stores
Ditto
considering ways of enhancing the ability of citizens to contribute to government understanding by use of crowd-sourcing type models
Werner Vogels mentioned in his Strata talk using Amazon Mechanical Turk for adding human-based processing for data control, data validation and correction, and data enrichment.
Bob Gourley: editor of CTOvision.com and a former Defense Intelligence Agency (DIA) CTO, @bobgourley ↩
Chris Boorman is the chief marketing officer and senior vice president of education & enablement at Informatica. He is responsible for Informatica’s global voice to the market, which includes corporate, partner and field marketing.
The thinking about social media in corporate marketing departments is rapidly evolving. Initially, social media was seen as yet another broadcast opportunity for pushing messages out into the world, and for many companies that view persists. A social media consultant recently said that even today, when he approaches potential clients for the first time, they typically refer him to their PR agency, because “they handle Facebook for us.”
There’s nothing wrong with using social media as a tool for disseminating marketing messages or trying to establish deeper relationships with current or potential customers. However, there is another use of social media which may prove to be more powerful over the long term: listening to the voice of the customer by data mining social networks.
Currently, CRM systems create customer profiles to help with marketing decisions using a combination of demographics and prior behavior, primarily historical buying patterns. These systems essentially enable companies to see their customers in the rearview mirror.
The customer data available via online communities like Facebook is both richer and more forward looking. A financial organization with access to such data would not only know that a customer had a checking account, savings account, two CDs and a mortgage, but also that the same customer was interested in golf or gourmet cooking — information that could be useful in planning future marketing initiatives. Every minute of every day, Facebook, Twitter and other online communities generate enormous amounts of this data. If it could be tapped, it could function like a real-time CRM system, continually revealing new trends and opportunities. Here’s how.
Tapping Social Media Data
The good news is that with today’s technology, this data can be tapped. But the process is not without its challenges. The data stream is a prime example of “Big Data.” Dealing with data sets measured in petabytes is a challenge in itself, and there is a serious problem with the signal-to-noise ratio. At my company, we estimate that at best, only 20% of the social media data stream contains relevant information. But before this problem even arises, companies face the issue of identifying their customers among the millions of participants in any given online community.
The Problem of Customer Identity
Most companies approach the problem of finding customers on social sites through the slow, arduous and expensive process of participating themselves. On Facebook, for example, businesses can gain access to the profiles of anyone who clicks the “Like” button on the company’s business site (depending on each customer’s privacy settings). With the right pitch, offer or game, companies can gradually gain an enhanced understanding of a subset of their social customer base.
With new matching technology that’s now available, the process is faster and more comprehensive. For example, matching technology uses artificial intelligence to figure out whether a given “John Smith” in a company’s customer database is the same individual as a particular John Smith on Facebook. The algorithms that accomplish this are extremely sophisticated, and they work. In fact, matching technology has been successfully used by law enforcement agencies to locate criminals.
If a company has one or two key pieces of information about its customers — e-mail address is often the most important — that company can accurately identify them on a social site and extract a substantial amount of data, including both profile data and transactional data that can reveal relationships important for marketing purposes. (Again, the amount of data available for any given customer depends on that customer’s personal privacy settings.)
Putting Data to Work
The second problem with social media is transforming data that is potentially useful into data that is actually useful. Social media data is generated by an entirely different technology stack than the transactional data that typically feeds CRM systems. Accordingly, it is stored in entirely different formats. That data can be transformed into a useful format with Master Data Management (MDM) technology.
MDM is the process of managing business-critical data, also known as master data (about customers, products, employees, suppliers, etc.) on an ongoing basis, creating and maintaining it as the system of record for the enterprise. MDM is implemented in order to ensure that the master data is validated as correct, consistent, and complete.
MDM has been used for more than a decade by companies that want to integrate disparate databases for a 360 degree view of their customers (or product portfolios, for that matter). It is equally effective in integrating social media data into existing CRM systems, and filtering that data for relevance.
What this all means is that companies can achieve important process improvements with bottom-line significance. For example, they can:
Obtain behavioral data that will allow them to more appropriately target segments for better marketing results.
Obtain data on personal preferences and interests to move closer to a true one-to-one relationship with their customers.
The disciplined use of demographic and historical customer data has enabled large numbers of companies to substantially increase the effectiveness of their marketing campaigns. Social media data will enable marketers to take targeting to the next level. It’s Big Data, but today’s technology can handle it.
Park Kieun (CUBRID Cluster Architect) gives an introduction to 4 large scale database technologies:
Massively Parallel Processing (MPP) or parallel DBMS – A system that parallelizes the query execution of a DBMS, and splits queries and allocates them to multiple DBMS nodes in order to process massive amounts of data concurrently.
Column-oriented database – A system that stores the values in the same field as a column, as opposed to the conventional ow method that stores them as individual records.
Examples: Vertica, Sybase IQ, MonetDB
Streaming processing (ESP or CEP) – A system that processes a constant data (or events) stream, or a concept in which the content of a database is continuously changing over time.
Examples: Truviso
Key-value storage (with MapReduce programming model) – A storage system that focuses on enhancing the performance when reading a single record by adopting the key-value data model, which is simpler than the relational data model.
Examples: many of the NoSQL databases covered here.