What is a Data Lake and What Does it Do?

Your organization already has a lot of information:

  • existing database systems (and it’s older versions)

  • website information and its clickstream

  • product or service production and/or performance measurements

  • social media that’s become a intricate portrait of interaction because one site doesn’t do you justice … or ensure your existence.

All these information venues are vital to your organization, but they don’t necessarily talk to each other. Each has different formats, and different purposes, and were created at different moments on the graph of Moore’s Law so how can all that data be used (or not wasted)?

In Big Data, this is the Data Lake.

The Data Lake holds a lot of awkward fish
The Data Lake holds a lot of awkward fish

Since Data Lake is a common term in Big Data speak, exactly what is it? What can a data lake do? What does the data lake NOT do?

“A data lake is a large object-based storage repository that holds data in its native format until it is needed. “ – Margaret Rouse at WhatIs.com

So with Big Data, the Data Lake holds the data in its original format, until time to use it.

How does the Data Lake work?

“While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.” – Margaret Rouse at WhatIs.com

Data lake was first associated with Hadoop and its object oriented storage; however, the term has become colloquial in business analytics and data mining.

So. .. a data lake is where all or a significant portion of an organization’s data can be stored by everyone and used by anyone – kinda like Woodstock meets the UN. All the information silos of the past are dissolved into a common mass with single point access.  This provides a holistic picture of an organization’s operations, as well as traceable accountability.  The Data Lake is an opportunity to keep all the organization’s information as an archive to manipulate all the pieces.

Previously this wasn’t possible because of the disparate structured databases that couldn’t “talk to each other.” Ironically, although placed into the data lake, they still don’t have to talk to each other. (UN right?)

UN dudes around table

That scenario is the ideal and reality is a bit more challenging.

Next post: data lake best practices.   Like inviting UN delegates from around the world in an empty room without any program or format or translators isn’t going to be very productive, even with the best heavy hors-d’oeuvre selection and a generous open bar.  The Data Lake is going to need some rules and best practices.