According to Wikipedia, a data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. It’s usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and others.

Building data lakes are somewhat distinct from typical IT projects (modernization or digital efforts), as an improper approach may lead to significant downsides during the delivery or integration phases. In this post, we will discuss the two most important pieces of the puzzle, nowadays ignored by novel IT projects.

Requirements

With the advent of Agile & Scrum, there’s a common myth that it’s OK to start without the full understanding of requirements and develop incrementally. This approach may work well for some of the agile projects such as Micro-services development. However, it’s important to hash out the critical requirements while on a mission to deliver data lakes.

Data Requirements

It’s important not only to understand the data, but also to understand the sources. This will help in understanding the ingestion patterns and solutions in place. Of course, the next step is to now review the ‘actual’ data—the structures and capacity needs. This last part is very essential. The solution (platform) is heavily dependent on the size & nature of the data. In the majority of the use-cases, it is equally important to understand the governance and compliance rules.

Analytics & Reporting Requirements

One of the primary reasons for organizations to build a data lake is to give data scientists and analysts an opportunity to slice, dice, and create meaning out of it. In order to provide meaningful data, it’s important to understand the data structures and expectations of those users and the tools currently in use. Data in the lake can be consumed by anyone using various technologies. But there is a significant difference in terms of investment and time. This is based on the tools and technologies, so careful review of those is pertinent.

Business Requirements

Any technology solution is meant for just that—solving a business problem. It’s prudent to understand the users of this data; particularly, the characteristics of the user, how the data will be useful for them, and the usage pattern. It’s equally as important to gather requirements around reporting, metrics, and analytics use-cases.

Architecture

The agile mindset (nothing wrong with that) usually drives us into the project (aka. deliverables) as soon as the requirements are understood. Here again, big data projects suffer to a large extent, especially if we miss one of the most important ‘architecture’ phases. The architecture phase is very different and unique. Thus it should be treated differently from design, which is more about data modeling and application design.

Data Architecture

In the early days of data lakes Hadoop was considered to be the only solution. During this time there was a whole lot of debate about the nature of the lake and its suitability for a multi-tenant environment. It’s clear now that any data lake we want to build NEEDS to support multi-tenancy. Also, as discussed in the requirements section, the data lakes serve multiple purposes and can’t be focused just around a few functionalities.

Because of these reasons, the data lake needs to be architected to consider a variety of data structures and series of functionalities and accommodate changes to multiple data domains. Architecture needs to focus particularly on the ingestion sources and patterns. The architecture aids in avoiding duplication of functionality, helping to keep the ingestion sources in a common repository and format. The data lake should use common vocabulary to reduce confusion and divergence and ensure a ‘single version of the truth’.

Platform Architecture

The de facto use of Hadoop for building data lakes has subsided now, so it’s important for the organizations to put emphasis on the platform. Organizations, particularly the users, need to consider cost & ROI along with the requirements. It’s imperative that organizations consider the cloud as a platform solution not only for the applications, but also for data lakes. A data lake will have to support multiple user groups and provision their requirements & needs. The user groups include technical (developers, data scientists) and non-technical (product managers, data analysts). So this cloud platform proposal will be imperative. The cloud also helps in elasticity, allowing the data lake to scale as the user or data grows.

A thorough understanding of the requirements, followed by hashing out the end-to-end architecture is a very essential prerequisite for successful implementation of data lakes.