What a Data Lake Isn't

I'm not a huge fan of the term "Data Lake," but I am less of a fan of it being hijacked for a sales agenda. So I would like to clear up a few points about what a Data Lake is and what it most definitely isn't.

What it is

A Data Lake is a repository which can contain structured, semi-structured, and unstructured data. The purpose of the repository is provide a way to bypass an ETL process and quickly move data into a queryable environment. This one location is ideal for mashing up disparate data sources and leveraging big data techniques such as applying schema-on-read.

I think you'll find this definition a bit more satisfying than what you'll find on wikipedia:
https://en.wikipedia.org/wiki/Data_lake

A Data lake is a large storage repository that "holds data until it is needed". The term was coined by James Dixon, Pentaho chief technology officer. As of 2015, data lakes could be described as "one of the more controversial ways to manage big data"

What it means

Firstly, a data lake is a concept (more on Microsoft's product in a moment) which blows a kiss in the wind with a promise that integration work is not necessary. Integration work, after all, is most of the work in a analytic project. If that work can be skipped, it's a huge win for the people who want their reports built today. A data lake needs analytical tools that can do light ETL work and have a low barrier to entry. If this is HDFS and Hive, great - if it is machine learning algorithms analyzing images or text, not great. If the data is clean an additive, wonderful. If it is full of duplicates, needs fuzzy join logic, revises history or is full of codes that need to be translated to a human readable form, not so great. A Data Lake hopes to spell the end to integration work and remove a huge barrier to getting to the fun stuff, the analytics. This may work well in some cases, but I'm not sure that it spells the end for Data Warehouses. I think we will find some happy medium between delayed integration work and fast analytics.

Earlier this year, Microsoft released a product called Data Lake. This is a version of the very product that they use to run Bing, Xbox, and skype. The success story here is that Bing was improved as a search engine because the data and analytic tools were opened up to all of Microsoft to participate in experimentation. So, I guess if it works for Microsoft...

What it Isn't

A data lake is not a data warehouse. It is not cleaned up an organized, it is not full of potentially irrelevant data. A data lake needs a catalog, but you can probably stumble your way around a well designed data warehouse. A data lake cannot live on an RDBMS. It just can't. They are opposites. They very nature of a relational database requires that data is stored in relations. This means that data must be organized into strongly typed columns. That you have to do any bit of work with schemas or any transformations to load a relational database blows the no-ETL premise out of the water.

Though distributed column stores like, vertica, are a big data technology, they are also a relational database and, therefore, are not a data lake. So big data does not necessarily equal data lake. There are many big data technologies that can serve the purpose as a data lake. So long as the technology is forgiving with schemas or simply doesn't care at all. Just keep this core principle in mind - a data lake provides a way to bypass an ETL process and quickly move it into a queryable environment. If it is not doing that, it is not a data lake.

Labels: Big Data