In my previous blogs in the Data Lake series, I have outlined what a data lake is and how to get started. In this blog I will outline one of the biggest challenges in getting the maximum value of using a Data Lake, being able /to share it across an organisation. Many organisations have invested in Data Lakes but have not unlocked the true value as they have struggled to get multiple business lines to successfully run on the same infrastructure.
To understand why this is difficult lets first understand the different types of data processing loads that typically need to run togther on these platforms.
They largely break down into two categories: Analytics and Operational use-cases
Analytics use-cases
Analytics use-cases, generally involve executing complex statistical mathematical algorithms, aggregations or calculations on vast amounts disparate, un-structured, semi-structured and structured data to find patterns to identify behavioural insights or a person, entity or system.
The information is typically ingested into centralised data infrastructure that is blended, correlated and analysed off-line by a handful of data scientists/analysts using statistical methods and algorithms e.g. Machine/Deep learning, data mining, natural language processing/understanding.
The service critically of these types of use-cases is not normally high because if any of these fail, then they can be re-run without affecting the core intra-day processes. There are of course exceptions e.g. AML, Fraud detection, trader surveillance, which are highly analytical but can trigger mission critical intra-day processes.
They also typically users driven (i.e. a super user fires of the processing jobs) and tend to respond in minutes or sometimes hours. The Analytics use-cases processing patterns are a mixture of set and graph processing models to perform analysis across a broad set of business data.
Operational use-cases
However, the majority of data processing workloads, in organisations, are operational use-cases. They typically run day on day, follow defined business processes and are a critical part of the operation of the organisation. If they stopped unexpectedly then the business is severely impacted. The data also, differs, in that they generally run using structured data on traditional data models (e.g. OLTP and Star/Kimble schemas).
They also perform RDBMS orientated relational algebra processing on different sets of data being delivered at different points in time. Data is typically ingested and transformed through a number of steps that require completion by multiple systems, people and work flows before the outcome is produced. e.g. General-ledger postings, risk production etc.
Operational use-cases perform orchestration of many, disparate steps of; simple processing logic, workflow and transformation tasks at the transaction level. The operational processing pipelines are triggered by data being received rather than responding to a user request.
They also typically run headless (i.e. on a server without the need to triggered manually) and could be run in long batch windows that could take place over hours or days. Operational processes also generally have a high level of business critically i.e. if they stopped then mid processes then critical business deadlines could be missed.
Multi vs. Single Tenancy
Data Lake platforms offer massive scalability and present the tantalising prospect of enabling a large enterprise the ability to centralise its data processing on to one infrastructure. The key components that enable this are: large scale distributed data storage (HDFS), highly scalable distributed processing framework (e.g. Spark) and resource management and monitoring components (e.g. Yarn, Mesos, Ambari) that execute track and monitor many discrete processes on the cluster.
These capabilities give the potential to run many different processing loads owned by different groups within the same company on the same physical cluster at the same time. This is called multi-tenancy. I.e. one physical group of servers executing multiple processes for multiple business functions (tenants) running discrete and separate processing on shared data.
Operating Multi-Tenancy Platforms
As data processing platforms, Data Lakes ecosystems are still maturing with respect to being able to run many different workloads simultaneously in a deterministic way. The resource management functionality has improved in leaps and bounds over the last couple of years in managing shared processing and memory models. Components like Yarn/Mesos have queuing and job prioritisation mechanisms that one can use to execute jobs based on critically. They can also start and stop jobs if higher priority tasks are required to run alongside them.
Given the lack of full QOS (i.e. deterministic processing times) can lead to processing pipe-line stalls and overruns than can break business SLAs that can mean massive costs can be incurred in running large support teams and even result in fines (especially true in financial services). Organisations have solved this in a number of ways including:
- Having massive clusters so that the jobs complete quickly (for example Google has 10000s of servers)
- Run the Data Lake stack in PaaS (Platform as a Service) type infrastructure to manage the containers/vms rather than the processes. These include the following: public cloud (AWS, GCP, Azure etc...) or a private/hybrid cloud container stacks such as OpenShift, Kubernetes, Docker, Mesos (DC/OS) etc....
- Having much simpler processing models with lower business critically e.g. tweets and shopping carts can be lost for one user without too much business impact
- Having sophisticated support teams that can monitor the clusters and remediate when this go wrong.
It is worth mentioning that the operational aspects of these components are improving all the time with resiliency and mission critical grade functionality being added with each new release. That said, a lot of the components that extend the eco-system are still being developed some of which are not quite production ready so need to be treated carefully before deployed into mission critical processes.
Architecture
The processing and data architecture approach also changes to support multi-tenancy. It is very hard to create one data model and processing pipeline that supports many disparate use-cases. So, to get around this there are the are a number of techniques that can be used:
- Using logic to join data rather than a data structure. E.g. rather than putting a foreign key in a data model to show an aggregation that is performed in code.
- Executing logic on the data directly (rather than copying data to specific applications and copying the results back)
- Schema on read (trying to join data on query rather on ingest)
- Copying data into different structures on ingest for different use-cases but always keeping the data automatically synchronised to avoid fragmentation.
Also, with multi-tenancy one has to think carefully about the security architecture. Shared data comes with the complexities of data that not everyone can see. Entitlement models can get very complex and one has to think carefully about how these are implemented. Row and Column level access applied a query time may be needed with complex encryption requirements.
Again, this has improved in the last couple of years but it still requires careful design in the multi-tenancy world.
Governance
Another key aspect of multi-tenancy is the governance and SDLC around shared resources. Given one of the key principles is being able to shared data then the data models will be used by many different processes. This needs to be governed carefully to avoid the problem of a new release of a shared schema breaking data access logic in existing processes.
To help this many data storage formats support schema evolution (e.g. Avro, Parquet, etc...). This is where the schema can change over time without changing the historical data. One still has to be very careful to make sure that new schema structures are released without breaking existing logic.
Another gotcha is, when developers can make assumptions about the models that they are implementing against e.g. hard coding structural links rather than using a metadata approach to find the relevant sub-element or attribute. Sometimes these are done for performance reasons but caching techniques for lookup logic can be employed to overcome this issue. The key here is to create agile data modelling approach that can change easily but in controlled manner.
Many data processing frameworks, Hive etc... support User Defined Functions. These are deployed in the cluster as shared resources used to execute logic on underlying data as on query time. These are great for extending functionality but if these are shared then a release and versioning strategy needs to be in place so that any future releases do not break existing dependent logic.
This also goes for building other shared components. Organisations have traditionally built applications in isolation, so the code bases of application components are typically separate and don't depend on each other (although there are exceptions to this). Given the shared nature of this infrastructure any components that other processes rely on need to have the strong dev, test and release cycle processes with good regression packs.
One way to achieve this is to treat all shared components as one would open source. Shared components are developed as part of a shared code base and multiple teams make changes that can be promoted after a review by a cross function group. Banks would need to make sure that this group functions correctly and that project pressures do not override this function and fragment the code base.
One of the other aspects organisations need to consider carefully is the operating model of how multiple disparate projects are funded, prioritised and implemented on shared infrastructure. Again, there have been many examples of multiple Data Lake being built due to departments wanting fast time to market to either fund, develop or deploy their applications that they can't get on shared infrastructure.
Summary
Data Lakes offer a great opportunity for organisations to transition fragmented I.T. estates to a shared data processing infrastructure but operational aspects need to be thought through very carefully and can be summarised as follows:
Immaturity of key technical components need to be considered very carefully before inclusion into mission critical components.
- The lack of many true multi-tenancy operational use-cases across the industry coupled with the rate of change in the technology introduces risk in implementation.
- Running different loads on the same cluster with enough control and partitioning to meet different operational SLAs need careful planning, monitoring or sophisticated design
- Managing the resource/SDLC contention of shared artefacts e.g. data models, UDFs, business logic, components and metadata on the platform need to be carefully controlled
- Managing and combining concurrent releases of specific business aligned functionality need to be delivered in a controlled and agile way across teams this includes shared testing and dev ops.
- The architecture (Technical, data security) need to be thought through carefully so that multiple teams, users and support staff can interact with the platform in an agile and compliant way.