Your Data Lake Works Fine. That's the Problem

Data Lake Operations (DLOps) are concerned with the secure, efficient, and cost-effective delivery of data from any available data source to the relevant business users, enabling them to make informed decisions at the right time.

In this article, we delve into the history of data lakes and underscore the pivotal role of DLOps in shaping effective data lake operations.

The Evolution of the Data Lake

The data lakes we have today were made possible by cloud services like Databricks and AWS’s S3 storage. But first, let’s look at what came before this.

The Beginning

In 2010, early adopters of data lakes, including tech giants such as LinkedIn and Google, faced significant challenges. The data lakes of that era, often referred to as data swamps, struggled with poor governance and metadata management, a testament to the pioneering spirit of those times because they had to push forward and find a way to work effectively without any real world guidance.

Typically, these early adopters were technology-focused companies, that had a history of running large-scale systems in production. Infrastructure-focused software engineers led operations. To these teams, infrastructure wasn’t something spun up by the IT team, which managed backups and any P1 incidents - the software engineers built and managed the infrastructure and operations of the data lake using automation.

The Cloud Revolution

With the advent of cloud services, particularly AWS and Azure, the data lake landscape underwent a revolutionary transformation. S3 storage emerged as the standard for data lake storage. Once we had cheap, accessible storage, Apache Spark then emerged in Databricks and AWS EMR, enabling nearly every other organization to build and maintain a data lake by tying the storage to managed compute that scaled effortlessly, if not cheaply.

The cloud revolution was a genuine game changer - suddenly, organizations could follow a process to consolidate their data into a single format and report from a single source of truth.

However, this democratization brought new challenges. First, there was genuine concern about creating a “data swamp,” which led many organizations to overdo governance, often resulting in numerous redundant layers within the data lake.

The second, more fundamental issue was operational: who supports and maintains the data lake? Unlike early adopters like Google, which had infrastructure-focused software engineers who understood data, enterprise teams often consisted of either infrastructure engineers who lacked data expertise or data engineers who lacked infrastructure expertise. This skills gap, reinforced by traditional IT frameworks that separated development and operations teams, meant data lake operations frequently fell through the cracks.

Given that a large percentage of teams supporting data lakes in today’s organizations are not specialists in both data and operations, it makes sense that the current state of play isn’t the end goal. This is what leads us to DLOps.

What should we consider when thinking of DLOps?

The challenges we’ve outlined - from data swamps or an overabundance of governance to unclear operational ownership and insufficient training budgets - all stem from the same root cause: lack of structured operational thinking. To address these systematically, we need to focus on three core pillars that underpin effective data lake operations:

Security
Efficiency
Cost

Security

Security has to come first:

Who can access the data?
Who has accessed the data?
What data do you have?
What sensitive data do you have?
Can sensitive data be extracted somewhere unsafe?
Has sensitive data been extracted somewhere unsafe?
How does the data platform read data without exposing source systems?
How do we manage secrets?
How do we remove access from people who no longer require it?
How do we categorize the data?
How do we know when data classifications change, and how do we deal with it?
How do we purge data that we no longer need?
How do we purge data when mistakes happen?
How do we monitor pipelines while ensuring that sensitive data does not leak through our operational processes?

Efficiency

Then, we care about how efficient the data platform is. We need to ensure that:

We can deliver the data to the business when it needs it.
It is fast to develop new pipelines and maintain existing pipelines.
We can deploy changes fast. We can’t afford to have a one-week release cycle.
We can identify and rectify any issues with the data processing as early as possible.
Do we have a consistent approach to modeling the data that allows for easy use in reporting?
We know what we need to do to develop code - is it a meta-data-driven framework? Python files?
How do we scale up when we need to and scale back down when the load is lighter?

Cost

Every organization has a different idea of how much the data platform should cost, and it is our job to understand and balance the business’s needs with the available budget. We need to ensure that we deliver value for money while spending the appropriate amount for each organization. There isn’t a one-size-fits-all approach, but at a minimum, monitoring costs and reacting early must be key attributes of a data lake operations team.

How do we measure our data lake operations?

The guiding principle of any data lake team should be to define the process for getting data into, through, and out of the data lake at a speed, reliability, and cost that meets business requirements and then automate and optimize that process.

Why does it matter?

The term ‘data lake’ implies that the lake itself is akin to a large body of water, with waves rippling over the top, and underneath, anything could be happening, from a few gently swaying plants to utter chaos.

The deceptive thing about data lakes is that they appear to “just work” on the surface. Your dashboards update, reports are generated, and stakeholders get their data. But underneath that calm surface, you might have:

A critical customer pipeline that’s been failing silently for three days, with downstream reports showing stale data that leadership is using for a major product decision
Duplicate data processing jobs running because someone forgot to decommission the old pipeline, quietly doubling your compute costs
Sensitive customer PII flows into datasets that the marketing team can access, creating compliance violations you won’t discover until an audit or a major business-ending security incident occurs.
A data scientist’s experimental model was accidentally promoted to production, now influencing pricing decisions with logic nobody understands

Unlike a crashed website or failed API where problems are immediately visible, data lake issues can go undetected for weeks or months while silently undermining business decisions. By the time you realize your customer churn model was trained on corrupted data, you’ve already lost customers acting on bad insights.

This is why operational discipline isn’t optional it is the difference between a data lake that quietly serves your business and one that quietly sabotages it.

There are few certainties in this world, but at one point or another, I guarantee that you will have data feeds with missing or incorrect data.

Evaluating Operational Effectiveness

To improve our operational effectiveness, we need to start by understanding exactly what it is that we have today and so understand how effective our DLOps are, we need to evaluate our performance across the three core pillars:

Security Effectiveness:

How do we maintain security?
How do we audit who is accessing the data?
How reliable is the data?

Efficiency Effectiveness:

How does data flow through the data lake?
How do we deploy code?
How do we test code and data?
How do we monitor for failures and anomalies?

Cost Effectiveness:

How much is it costing? Are costs going up or down?
How can we optimize in terms of cost and performance?

Each of these questions doesn’t have a right or wrong answer but rather a scale of how effective the current process is, which leads us to understand how, what, and why we can improve the existing processes and in which order.

A data lake can never be a single snapshot that continues working forever. Data constantly changes, requirements change, cloud platforms change, people change, and the business changes - the only way to deal effectively with all these changes is by applying Data Lake Operations (DLOps) to our data lakes and platforms. Welcome to DLOps.

To stay up to date with new articles and DLOps news please subscribe:

Your Data Lake Works Fine. That’s the Problem