What is Data Munging?
One of the advantages of Big Data is the Data Lake. The Data Lake stores all the legacy databases in addition to new data storage and even streaming data. In order to combine and use this broad variety of source information, some amount of “cleaning” or rearranging that data has to happen. Think of it like room mates moving in, each person retains his or her personality but certain house rules will be necessary for everyone to live together.
Your Mother Doesn’t Work Here
In the Big Data world this cleaning is called data munging (or wrangling). Data wrangling does conjure rather accurately the task of roping a large or small but unwieldy live animal and forcing it into restraints. Munging seems to be used more prevalently. Perhaps because it’s a cool made up word that sound like the munching only the Cookie Monster could evoke, although one reference suggested it was an acronym.
The world according to Wikipedia:
Data munging or data wrangling is loosely the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools. This may include further munging, data visualization, data aggregation, training a statistical model, as well as many other potential uses. Data munging as a process typically follows a set of general steps which begin with extracting the data in a raw form from the data source, “munging” the raw data using algorithms (e.g. sorting) or parsing the data into predefined data structures, and finally depositing the resulting content into a data sink for storage and future use.
Data munging can be a simple(r) task of matching a new dataset to existing. It can also be matching two or more data sets together. Or it can be a Herculean effort of conforming multiple sources in multiple formats that all weren’t created to fit together – until the Data Scientist makes it so.
Here’s a quick list of the tasks associated with munging in simple terms for those of us that aren’t so cool or data intelligent.
- Enrich the data
- Apply a Macro
- Find Pattern (Regular Expression)
- Transform Data Types
- Missing Data Handling
The last task, missing data handling seems the most intriguing. Like the odd sock sucked into the dryer vortex or the peek a boo the computer plays with specific emails that use to be there and can’t be found, most data sets have missing data items. That aspect alone could fill another post, but it presents an easy to understand concern about Big Data and data lakes. It seems too good to be true – placing all this disparate silos in one place and getting to use them for a collective picture.
The answer is yes, it is possible, and the BUT is cleaning data takes resources. The post-script is that it’s not a binary decision. Clean it “enough” to do what you want.
Munging didn’t make the original Glossary of Big Data Terms on What’s The Big Data Idea, so it’s being added with this post.
This was another definition wandering the internet, but the permanent distortion aspect didn’t seem to be used in internet research.
Mung or munge is computer jargon for a series of changes to a piece of data, which are often well defined and individually reversible, but which transform the original item into an unrecognizable form. The changes may be destructive, e.g. by corrupting a computer file, or simply concealing, e.g. changes to an email address to disguise it from spambots.