Big Data Basics – Start Here

Big Data Basics – Start Here

Bringing Big Data to the People (Part 1 of 6)

“Big Data” is being talked about more and more, but just what is Big Data? Big Data is new capability and new perspective, and it’s only just beginning to be implemented. Big Data is not just an IT thing; it’s not the Internet of Things or a passing catch phrase. Like www and the Information Age, it changes the way we do everything and it challenges long standing paradigms.

Let’s Begin

The beginning is the 3 Vs: Volume, Velocity and Variety, although recently, a fourth V has appeared as well – veracity. This diagram is a quick overview of the 3Vs, and the next posts will dive into each of the 3 Vs, the fourth V, and eventually, the Dark Side of Big Data.

The Three V’s of Big Data

The 3Vs of Big Data Source:Wipro

The 3Vs of Big Data
Source:Wipro

What’s the Big Data Idea Blog

The “What’s the Big Data Idea” blog is dedicated to exploring the possibilities of Big Data. Some posts are about the “what is” and some are about “what if”. Today most Big Data applications are used by big corporations in ways you have probably seen but didn’t appreciate. When you learn how Big Data works, you can see how it already plays in your life. You also learn how you can use Big Data in your own business of whatever size and in your own personal life.

The About pages are the basics on Big Data, so you can use them for reference. The blog itself expounds on these concepts and brings you reports on the technology and tools and players as Big Data develops.

Next Up – Phenom Volume

Bringing Big Data to the People (Part 2 of 6)

Big Data is originally defined by the 3Vs – volume, velocity, and variety, although a fourth V – veracity – has entered the mix of late.  Big Data is more than just information collection on a pedestrian level.  Although you get there’s more data around than just 10 years ago, perhaps you don’t realize how much more! That is VOLUME!

Phenom – Volume

Big Data is emergence in a chaotic world.  Big Data is phenomenal because data has always been available in our world, but as far back as recorded history, now is the first time we have begun to “see” data for the first time in the sense of the whole and not a sample.  In the information age, we have become accustom to data everywhere but the proliferation of data collection is both exponential and relatively recent.  

Increase in Volume of Data

Collection of data has been going on for centuries; however, in the past decade the volume has exponentially increased.

Underlying this accumulation is also the progression from analog to digital storage, which has consequences and capabilities of its own.  The evolution of a digitized process (digital conversion of analog data) to a digital one (complete digital capture) is only just underway.   Formerly, most information was collected via humans and converted to a data format (spreadsheet, database, etc.). Now Big Data reflects the flip to collection via machines and networks that capture human interaction in physical spaces (machine learning, telemetry, etc.) as well as digital (transactions, email, social networks, etc.).

Switch from Analog to Digital DataYour personal trail of information is your digital exhaust and like an old Chevy burning oil, by and large your information is visible to the world as you go about your journey.

  Next up: Velocity – Just how fast are we going?

Bringing Big Data to the People (Part 3 of 6)

Processing – Velocity

Velocity is the accelerated rate at which data accumulates.  Big Data isn’t just more data but more data faster too.  For example, when you visit a website, the data trail isn’t purely transactional – how much of what did you buy – it follows what you clicked on, how long you spent on a page, how long on the website and more.  Now any off-the-shelf web data program offers this depth.

This streaming data is real time (or relatively real time) feedback loops used to solve problems or provide products or services. This is critical when knowing what the environment was a year ago, a month ago, a day ago, or even a minute ago may be irrelevant to now.  

When scientists first decoded the human genome in 2003, it took them a decade of intensive work to sequence the three billion base pairs. Now, a decade later, a single facility can sequence that much DNA in a day.

Behemoths such as Walmart and Google have been using Big Data for some time to be able to bring you the goods and services you want (or didn’t even realize you wanted) because they had the resources to pursue its value.  Within the past few years though, that processing power is now a garage startup capability.  The cloud is cheap to rent.

Google processes more than 24 petabytes of data per day, a volume that is thousands of times the quantity of all printed material in the US Library of Congress.

Google processes more than 24 petabytes of data per day, a volume that is thousands of times the quantity of all printed material in the US Library of Congress.

As the data population has grown, so has the processing capability to consume the data.  As far back as 2005, a cell phone (even without a camera) had more processing power than NASA’s mission control during the Apollo flights that put men on the moon.

…, in 1986 around 40% of the world’s general-purpose computing power took the form of pocket calculators ...

…, in 1986 around 40% of the world’s general-purpose computing power took the form of pocket calculators …

…,  which represent more processing power than all personal computers at the time
… which represented more processing power than all personal computers at the time

Next up – Variety, the spice of life  

Bringing Big Data to the People (Part 4 of 6)

Beyond Natural Selection – Variety

Data used to have to be carefully selected for processing both in quantity and quality.  Data was strictly formatted.  At first, its gatekeepers were men in lab coats and pocket protectors (and eventually morphed to the IT guys.)  

The original Computer lab

As data became more prolific, it became more personal through spreadsheets and databases that were possible on home computers via Lotus and Microsoft.  Anyone with a PC and cheap software could learn basic capabilities with a little effort. With a lot of effort, any PC could actually accomplish quite a bit with these tools (most users only utilize less than 10% of any MS product ability.)  Anyone who’s worked with a pivot table or even just got the “!” trying to use a spreadsheet understands the need to have the right format to manipulate the data.

!  the data error

! the data error

Big Data is a lot more than a Big MS tool.  BD consumes all data – heterogeneously – words, images, audio, telemetric, transactional, scanned analog, legacy databases and social media.  The data must still be scrubbed but BD ingests everything – an information jabberwocky of sorts.  

more VARIETY in Big Data - even how it's defined

more VARIETY in Big Data – even how it’s defined

This scrubbing process changes source data to application data, which can then be manipulated. The increase in variety and subsequent scrubbing process has given rise to the Fourth V – veracity – the uncertainty of data.

Bringing Big Data to the People (Part 5 of 6)

What Not Why – Not Your Mother’s Scientific Method

What Not Why is a mental shift that accompanies the 3 Vs of Big Data. Big Data consumes great volumes of a variety of data and produces ‘what” the data is. Big Data tells you what is happening with the data, but not why. The “answer” Big Data gives is not “why” but “what”?

Walmart, Hurricanes & Pop Tarts

For example, Walmart has been a leader in data accumulation pre-dating true Big Data emergence.  Product placement is critical for profit margins. When Walmart began using that data, one correlation they found was that prior to a hurricane, not only did people stock up on batteries – but also Pop Tarts.

Unlike this Big Data example, in traditional Scientific Method, a hypothesis would be created, such as when a hurricane is coming, people buy “________”.   A specific representative data sample would be calculated. The test would be run with a product and then repeated until a positive result (accept the hypothesis) indicated what was bought prior to a hurricane.

This iterative process is Trial and Error.  Whereas a data analyst finds answers to questions, data scientists manipulate the data to see what it tells them.  Scientific method and hypothesis testing of data sets has required math – probability and statistics.

This iterative process is Trial and Error. Whereas a data analyst finds answers to questions, data scientists manipulate the data to see what it tells them.

Big data does not need a sample set of the correct data to prove or disprove an idea. As in the Walmart example, study of the entire data itself provides a result without a pre-conceived notion of what the “answer” should be. Big Data scientists look for what the data tells them, not whether or not their hypothesis holds up.

Does Walmart know “why” people buy Pop Tarts before a hurricane?  Maybe or maybe not, but they do make sure to stock them near the front.

Scientific method and hypothesis testing of data sets has required math – so can we forget about probability and statistics now?

Bringing Big Data to the People (Part 6 of 6)

So can we forget probability and statistics?

No, but Big Data usage does involve switching train of thought on several items.

 No Sample Set

The Scientific Method taught in school involves selecting what to test and framing a question and a hypothesis to test.  Using probability and statistics, the sample size, large or small, has been used to represent the data as a whole.  Big Data doesn’t need that.  It processes everything – not sample sets.  

Accuracy

In the scientific method, an accurate sample set is needed to effect more dependable results (hence confidence intervals.)  Probability and statistics is based upon using samples because this has been the capability of processing the data.  

Evolution of data processing

Evolution of data processing

That processing capability has become faster and cheaper.   Along those same lines, today start-ups, small businesses, and other venues can utilize BD, not just Google and Walmart.

Bias 

The conventional hypothesis method also introduces bias via the experimenter’s questions.  When restricted to choosing the correct question to ask, the experimenter loses all the possible solution sets available from the data.  How many times have you been frustrated by not asking the correct question when looking for an answer?

Big Data and visualization don’t just provide a cool new way to look at data, it presents data in a way that would not have been seen with traditional scientific method technique.  This video is a somewhat brief explanation of how Big Data can remove personal bias through visualization – and that’s not just pretty pictures.

Leave Comment
  1. 1

    Fit or Fat? Big Data Body Sizing Explained | What's the Big Data Idea

    […] Big Data Basics – Start Here […]

    Reply

Leave a Reply

%d bloggers like this: