Data Lakes #2: Should I dive in?

Data lakes, as I said in my previous blog, are the latest buzz word in analytics. While I warned about jumping into the world of data lakes without thinking carefully about what you want to achieve, I do believe that these data depositories are the long-term future. They have a clear advantage over data warehousing because they offer a non-relational way of looking at your data; you join the data together when you want to ask a question, rather than structuring your data in advance according to the questions you think you'll need to answer.

And that's not all - data lakes win out over data warehousing in just about every area:

Data lakes pros

Data warehouse cons

Can take any form of unstructured or structured digital data, whereas data warehousing only supports structured data.	Require expensive reconciliation infrastructure - with data lakes the processing comes to the data, rather than the other way around.
Data lakes are very easily scalable to cope with huge volumes of data; data warehouses are notoriously difficult and expensive to scale up.	Encourage a siloed approach to data – they don't cope well with sharing. A data lake, on the other hand, is built around the concept of accessibility and sharing.

There are clear benefits for organisations that choose to migrate off data warehousing and onto a centralised platform. So the next question is, how to move from one to the other? There are three possible routes for doing this:

Strategy 1: Migrate your data from the data warehouse to the lake
Strategy 2: Use the warehouse and lake for separate functions
Strategy 3: Use a visualisation layer to combine both solutions.

There are, at first glance, attractive advantages to Strategy 2 — no IT consolidation will be needed and you can use your existing infrastructure, meaning that it will be quicker and cheaper to implement. But it will result in massive duplication of data and cross-data analysis will be extraordinarily difficult. Similarly, Strategy 3 is cheap at first sight, but I'd say it's the highest risk option because it's extremely hard to join the data between the two systems.

So, of these three approaches I would firmly advise Strategy 1 - but a careful, gradual migration rather than a big bang. Start small and move carefully:

Choose a simple case where the technology could have a clear outcome
Run a pilot exercise on this test case before you commit fully to investing in the infrastructure
Use the cloud, if you can
Don't underestimate the data challenge. Your data needs to be of good quality and it will need to be normalised - i.e., it should have the same level of granularity and quality to give you confidence that you're comparing apples with apples.

The last point is particularly important. Data is fundamentally a process and the quality of what you get out is directly related to the quality of the data you put in. A data lake isn't a dumping ground for everything digital you own; if you pour polluted water in, you wouldn't want to drink from it.