Strategies for Exploiting Large-scale Data

What I'm Reading

In a guest post hosted by Cloudera blog, Bob Gourley^[1] enumerates the characteristics of working with Big Data from federal agencies perspective.

I think these can be generalized to all businesses and problems that require big data:

Federal IT leaders are increasingly sharing lessons learned across agencies. But approaches vary from agency to agency.

For a long time each business worked in its own silo.

Yesterday, tools and algorithms represented the competitive advantage. Today the competitive advantage is in data. Sharing algorithms, experience, and ideas is safe.

federal thought leaders across all agencies are confronted with more data from more sources, and a need for more powerful analytic capabilities

If you are not confronted with this problem it is just because you didn’t realize it. If you think single sources of data are good enough, your business might be at risk.

Large-scale distributed analysis over large data sets is often expected to return results almost instantly.

Name a single manager or a business or a problem solver that wouldn’t like to get immediate answers.

Most agencies face challenges that involve combining multiple data sets — some structured, some complex — in order to answer mission questions.

increasingly seeking automated tools, more advanced models and means of leveraging commodity hardware and open source software to conduct distributed analysis over distributed data stores

Ditto

considering ways of enhancing the ability of citizens to contribute to government understanding by use of crowd-sourcing type models

Werner Vogels mentioned in his Strata talk using Amazon Mechanical Turk for adding human-based processing for data control, data validation and correction, and data enrichment.