One of the greatest advantages of utilizing Big Data is the Data Lake – the depository of all data things large or small.
First of all, the Data Lake can store all the legacy databases an organization can’t seem to shake. There’s usually some information storage that holds the company ransom. To that, you add new data forms and capabilities – even streaming data.
In order to combine and use this broad variety of source information, some amount of “cleaning” or rearranging that data has to happen. ETL (Extract Transform Load) prepares data for basic storage but let’s talk about something more.
Your Mother Doesn’t Work Here
In the Big Data world this cleaning is called data munging (or wrangling). Data wrangling does conjure rather accurately the task of roping a large or small but unwieldy live animal and forcing it into restraints. Munging seems to be more prevalent. Perhaps because it’s a cool made up word that sound like the munching only the Cookie Monster could evoke, although one reference suggested it was an acronym.
Munging is more strategic than ETL by looking forward to the desired end state. Think of it like room mates moving in, each person retains his or her personality but certain house rules will be necessary for everyone to live together.
The world according to Wikipedia:
Data munging or data wrangling is loosely the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools. This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, “munging” the raw data using algorithms (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use.
Apples & Oranges Can Make a Good Fruit Salad
Data munging can be a simple(r) task of matching a new dataset to existing. It can also be matching two or more data sets together. Or it can be a Herculean effort of conforming multiple sources in multiple formats that all weren’t created to fit together – until the Data Scientist makes it so.
Here’s a quick list of the tasks associated with munging in simple terms for those of us that aren’t so cool or data intelligent.
- Enrich the data
- Apply a Macro
- Find Pattern (Regular Expression)
- Transform Data Types
- Handle Missing Data
The last task, missing data handling, is the most intriguing. Like the odd sock sucked into the dryer vortex or the peek-a-boo your computer plays with specific emails that used to be there(!) and can’t be found, most data sets have missing data items. That aspect alone could fill another post, but it presents an easy to understand concern about Big Data and data lakes.
It seems too good to be true – placing all this disparate silos in one place and getting to use them for a collective picture. Finding out there are holes in your data might be unnerving; however, this has always been the case. In Old School days of Small Data, samples are used to approximate the content and capability of a data set, which incurs a confidence factor. Samples are a statistical proportion of data, so you don’t know the missing pieces. Big Data absorbs all data; it knows the gaps.
So the answer is yes, it is possible to effectively engage the different data offspring of your organization, BUT cleaning data takes resources. Data Scientists report it takes 50-70% of a project’s time and often requires multiple returns to the “source” themselves – the people who used the data. All data – even Big Data – always comes back to the creators.
It All Works Out in the End
The post-script is that munging is not a binary decision. Like the sock drawer or your kids rooms, it need not be completely organized at every moment, but instead it can be clean “enough” to do what you want (or what you can live with.) “Enough” is a measured decision between leadership and the data scientist doing the munging.