According to Wikipedia, a data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. It’s usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and others.
Building data lakes are somewhat distinct from typical IT projects (modernization or digital efforts), as an improper approach may lead to significant downsides during the delivery or integration phases. In this post, we will discuss the two most important pieces of the puzzle, nowadays ignored by novel IT projects.
Requirements
With the advent of Agile & Scrum, there’s a common myth that it’s OK to start without the full understanding of requirements and develop incrementally. While this approach may work well for some of the agile projects (ex. Microservices development), it’s important to hash out the critical requirements while on a mission to deliver data lakes.
Data Requirements: It is important not only to understand the data, but also to understand the sources. This will help in understanding the ingestion patterns and solutions in place. Of course, the next step is to now review the ‘actual’ data—the structures and capacity needs. This last part is very essential, as the solution (platform) is heavily dependent on the size & nature of the data. In the majority of the use-cases, it is equally important to understand the governance and compliance rules.
Analytics & Reporting Requirements: One of the primary reasons for organizations to build a data lake is to give data scientists and analysts an opportunity to slice, dice, and create meaning out of it. In order to provide meaningful data, it is important to understand the data structures and expectations of those users, and also the tools currently in use. While data in the lake can be consumed by anyone using various technologies, there is a significant difference in terms of investment and time, based on the tools & technologies, so careful review of those is pertinent.
Business Requirements: Any technology solution is meant for just that—solving a business problem. It’s prudent to understand the users of this data; particularly, the characteristics of the user, how the data will be useful for them, and the usage pattern. It’s equally as important to gather requirements around reporting, metrics, and analytics use-cases.
Architecture
The agile mindset (nothing wrong with that) usually drives us into the project (aka. deliverables) as soon as the requirements are understood. Here again, big data projects suffer to a large extent, especially if we miss one of the most important ‘architecture’ phases. The architecture phase is very different and unique and should be treated differently from design, which is more about data modelling and application design.
Data Architecture: There was a whole lot of debate during the early days of data lakes, when Hadoop was considered to be the only solution, about the nature of the lake and its suitability for a multi-tenant environment. It’s clear now that any data lake we want to build NEEDS to support multi-tenancy. Also, as discussed in the requirements section, the data lakes serve multiple purposes and can’t be focused just around a few functionalities.
Because of these reasons, the data lake needs to be architected to consider a variety of data structures and series of functionalities and accommodate changes to multiple data domains. Architecture needs to focus particularly on the ingestion sources and patterns. The architecture aids in avoiding duplication of functionality, helping to keep the ingestion sources in a common repository and format. The data lake should use common vocabulary to reduce confusion and divergence and ensure a ‘single version of the truth’.
Platform Architecture: The de facto use of Hadoop for building data lakes has subsided now, so it’s important for the organizations to put emphasis on the platform. Organizations need to consider cost & ROI along with the requirements—particularly the users. It’s imperative that organizations consider the cloud as a platform solution not only for the applications, but also for data lakes. A data lake will have to support multiple user groups and provision their requirements & needs. The user groups include technical (developers, data scientists) and non-technical (product managers, data analysts), so this cloud platform proposal will be imperative. The cloud also helps in elasticity, allowing the data lake to scale as the user or data grows.
A thorough understanding of the requirements, followed by hashing out the end-to-end architecture is a very essential prerequisite for successful implementation of data lakes. In our subsequent posts, we will discuss the technical details in building data lakes. Until then, happy reading.