Lakehouse Data Management

Lakehouse data management technology innovation

Most sport teams and university advancement teams understand the efficiencies gained by combining business data from multiple different sources in a central location. Over the last 30 years, there have been several technical advancements designed to combine data streams more efficiently. First there was the data warehouse, followed by the data lake. While there are benefits to both technologies, neither is exactly right for sports and universities. A hybrid technology – the data lakehouse – represents the best practice for sport teams and universities by combining the positive aspects of both. In order to understand the lakehouse approach to data management, it is critical to first understand the evolution of data environments.

Data warehouse

Known as critical to the support of business intelligence applications, the data warehouse came on the scene in the 1980s as a system that could handle large, structured data sets. Because data warehouses consist of a series of rigidly architected, highly interconnected and interdependent tables, they do not always provide the ideal data management solution – especially in organizations like sports organizations and universities which have complex and rapidly changing sources and uses of data. In these environments, even modest changes often require significant effort to implement. As such, warehouse maintenance costs for sports teams and universities tend to be very high.

Fast forward to present day; unstructured data and semi-structured data is prolific. Data with high variety, velocity and volume is generated in forms relevant to today’s technology. Text messages, social media, images, audio, video, and chat bots, are all examples of unstructured data. All of these examples are common in both sports and university fundraising environments. And, they are exactly the types of data for which a data warehouse is not optimized.

Data Lakes

As data and data environments have evolved, the need for a reasonably priced system that provides flexibility and processing performance became obvious. Out of this need arose the data lake – a repository for raw data in a variety of formats. Among the advantages of using a data lake are:

Ability to store massive amounts of data in native formats
Users can access both low and high velocity information easily
Multiple data formats are easily ingested
Lower overall cost

For these reasons, data lakes are suitable for storing and providing a solid basis for predictive analytics. However, they have some drawbacks. These include:

Highly skilled data scientists are usually required to extract usable insights
Natively stored data formats must be reformatted manually and can’t be easily curated or arranged for specific business reporting purposes as they can in a data warehouse
Difficult to establish data governance
Lack of enforcement of data quality
Difficulty in matching different record types

The problem with the solution

Because each solution provides clear benefits, but neither is ideally situated for sports teams / university fundraising teams, some organizations use both. In these cases, the data warehouse is typically the source for business intelligence use cases (basic reporting and visualizations). The data lake serves more advanced data science use cases (modeling, predictive analytics). However, this multitude of systems introduces both complexity and inefficiencies as data is copied and/or moved between systems.

graphic with data warehouse and data lake on the left, each pointing to business intelligence and data science on the right

Using the data warehouse in conjunction with the data lake, the platform serves data both business intelligence and data science purposes. However, this solution creates significant engineering work to ensure consistency between the data lake and the data warehouse. The cost of managing this ecosystem is quite high.

graphic with box around data warehouse and data lake on the left, with arrows pointing from the box to business intelligence and data science on the right

Lakehouse data management solution: a single platform

The lakehouse combines the best benefits of both the data lake and data warehouse architecture and serves data for BI and data science in a single system. The lakehouse design implements data structures and management features similar to those in a data warehouse, but puts them directly on top of the low cost and open format of cloud storage. This structure is ideal for industries that need consistent reporting while maintaining the ability to change data schemas quickly and efficiently.

graphic describing the lakehouse with data formats on the left, with arrows pointing to the lakehouse in the middle, with arrows pointing to the various outputs on the right

Lakehouse data management features

Key features of the lakehouse include:

Simple and flexible data management capable of serving multiple business purposes: basic analytics and reporting as well as more advanced artificial intelligence and machine learning applications.
Supports data warehouse schema and architectures
Data governance capabilities, including auditing, retention, lineage
Eliminates need for two copies of data in data lake and warehouse
Open and standardized storage formats allow for use of machine learning tools and programming libraries have direct access to data
Supports multiple data types from unstructured to structured and semi structured data
Facilitation of real-time data applications, such as reporting

Lakehouse data management benefits

Among the benefits of the lakehouse are:

Cost effective
Admistrative burden alleviated with single platform
Data governance simplified with single point of control
Schema management simplified
Data flexibility allows for multiple users and use cases
Data adapts readily to multiple use cases
Adaptable to constantly changing technology and data landscape

Why the lakehouse data management?

The lakehouse is a data management architecture that radically improves enterprise data infrastructure and accelerates innovation. It offers the structure needed for any member of the team to easily produce reliable business reports, and allows data scientists to access raw data feeds for use in more advanced analytical applications. It is adaptable to ever-changing data environments and ensures that new data sources can be quickly, easily and cleanly integrated with existing data sources. Because of its adaptability and a simpler implementation framework, costs are typically lower than traditional data management solutions in complex data environments like sports and university fundraising organizations.

So, why use a lakehouse data management instead of a data lake?

A data lake is the ideal solution for inexpensive ingestion and storage of large quantities of raw data, which mirrors the architecture of the lakehouse. In addition, the lakehouse also allows:

The extraction of data insights from both business intelligence professionals and data scientists
Easy establishment of data governance
A frictionless reporting framework which is easily modified as your data changes and new use cases are identified
Enforcement of data quality
The ability to link and de-duplicate records relating to data provided by multiple sources

Why use a lakehouse data management instead of a data warehouse?

In organizations whose data structure or business needs rarely change, the warehouse is the ideal structure for generating business reports that combine more than one data source. Cleaning (de-duplicating) the data should be a simple undertaking. When built correctly, it should provide easily accessible reporting to non-technical staffers. In complex data environments, the lakehouse:

Allows for faster implementation and modification timelines
Offers greater reporting flexibility
Is less expensive to maintain, because the highly structured nature of the warehouse requires more staffers to ensure that minor changes don’t have a major impact
Serves B.I. (reporting) and A.I. (predictive analytics) needs alike

Why use a lakehouse data management instead of simply using both a data warehouse and a data lake?

Sports teams and universities need the positive aspects of both the data lake and data warehouse, but also require a platform capable of handling data with high variety, velocity, and volume. However, most organizations will find it cost prohibitive to ingest and store two identical data sets. Moreover, maintaining consistency between the two systems puts undue burden on the inhouse data team whose efforts must focus on generating value-creating insights for their stakeholders.

Published: 06-01-21