Episode #2

Evolving Data Architectures: Lakes, Warehouses, and Lakehouses

Two leading experts in data management and analytics (D&A) will put their heads together to explain the ever evolving data architectures of Lakes, Warehouses and Lakehouses.

Play Video

Guest

Philip Russom

Data Industry Analyst, Dataverse.ai

Guest

Scott Geissler

Director of Business Development, ActualTech Media

Episode Overview

Data Warehouse Architecture

Today’s warehouse, even when built anew on cloud, resembles warehouses of the recent past, but adjusted for strong trends and influences. These include: data science and advanced analytics; new approaches to data modeling; streams and other forms of real-time data; new technologies and practices for storage.

Data Lake Architecture

The earliest data lakes were built on Hadoop by data scientists. Today’s lakes are almost always on cloud. They still support data science, but an enterprise-scope lake will also support many other business functions, including data warehousing, sales and marketing, digital supply chain operations, data archiving, and so on. The data lake’s architecture has evolved into greater complexity, as it supports a lengthening list of use cases.

Data Lakehouse Architecture

A modern data and analytics architecture invariably includes both a data lake and data warehouse, which interoperate tightly and constantly. When these two micro architectures work together in a larger macro architecture, there is some redundancy between the two and data travels back and forth among. To reduce the architectural redundancy and make data movement more efficient, the so-called “lakehouse” arisen, as a convergence of warehouse and lake architectures.

Key Moments

Transcript

Scott Geissler: 00:09

Hello. Welcome. Thank you for joining us today on the Dataverse.ai podcast, addressing evolving data architectures. We’ll be discussing evolving data architectures around data warehouses, data lakes, and data lakehouses, and the issues that practitioners face on a daily basis working with these types of architectures. I’m happy to be joined by my good friend and former colleague Philip Russom, who joined us on our first podcast as well. Philip has 25 years experience as an IT industry analyst, researching user best practices, vendor products, and market trends in data management and analytics. This includes several data disciplines, such as data warehousing, data lakes, data integration, data quality, hybrid data architectures, cloud data management, data governance, analytics, and databases, and other data platforms. Philip has worked with most of the world’s leading IT analyst organizations such as Gartner, Forrester, the Giga Information Group, TDWI, as well as Hurwitz Group. In those positions and others, he produced over 650 research reports, magazine articles, speeches, webinars, vendor briefings, and advisory consulting engagements before becoming an industry analyst. Phillip worked for database software vendors as a product manager, product marketer, and documentation writer. I’m pleased to turn it over now to our expert speaker Philip Russom.

Philip Russom: 01:50

Hey Scott, it’s good to see you again, and even better to work with you again.

Scott Geissler: 01:55

Likewise. Good to see you too.

Philip Russom: 01:57

So yeah, thanks for that great introduction. And what I’m gonna do today is you know, like Scott was telling you folks, I’m gonna talk about what data architecture is, but I’m gonna do that very briefly because the real meat of today’s presentation is to look at some evolution that’s going on in three very prominent data architectures, namely the data warehouse, the data lake and the new one, the data lakehouse. And so for those of you attending, I assume you have some experience with data, you might even be a data professional like a data warehouse person or a data scientist, data analyst, that sort of thing. So I’m gonna assume we have you know, people with knowledge of those things here in the audience. And that’s why I can go to some of the more pressing issues in warehouses lakes and lakehouses today, as opposed to spending a heck of a lot of time defining them.

Philip Russom: 02:50

But you know, for those of you who might be new to the stuff stay with us, because I will have definitions and I think you will be able to follow along. And in fact in that realm let’s start with the definition of what a data architecture is. You know, something I have to tell people and tell ’em repeatedly, because they do sometimes forget, a data architecture is really about the data first and foremost. And you know, sometimes we get so wrapped up with the tools and platforms, we forget about the data. But ultimately, the quality of your data architecture, how usable it is to the business, how well does it represent the business, those kind of questions involve questions like did we find the right data? Did we collect it properly?

Philip Russom: 03:31

Did we repurpose data to fit this newer architecture, and that sort of thing. But you know, the data architecture is also about data about data. That’s the way we usually define metadata. And you know, there’s more than one form of metadata. Like technical metadata is typically a description of data from technology viewpoints. Like where’s the data stored, what data types is allowed for this, for this data element what are the allowable data values, and so forth. But you know, there’s other ways to describe data, and the data catalog is very much an upcoming addition to modern data architecture. So metadata’s been with us since the since the origin of computing systems, and it’s still very important for describing data. So you can find it in an architecture, but it’s being extended through similar practices like the data catalog and data lineage.

Philip Russom: 04:25

Now this data architecture assumes that there are numerous data sets that have been brought together and integrated for some common goal. A lot of the examples I’ll give you today are in data analytics. And so when we build a data architecture for the warehouse lake and the lakehouse, typically we are bringing together a very long list of data sets, and we are integrating them for analytic goals. See what I mean? But a lot of what I’m talking about today with data architecture applies to other goals. Like a lot of people build a similar data architecture for operational data, and I have a lot of clients in manufacturing, so they tend to collect a lot of information about what’s going on on the shop floor so that they can look at the numbers and say, are we gonna meet our service level agreement today?

Philip Russom: 05:10

Are we gonna produce the production yield that we have promised in terms of numbers of units and that sort of thing? So it’s not just analytics, there’s all kinds of reasons why you would build a data architecture and bring data into it. And you know, for a lot of what I talk about today, you’re gonna automatically think about data at rest in storage managed by a database management system. And, you know most of the data in a data architecture fits that description, but don’t forget some of the most exciting data. Some of the data that gives you a very different business value is data in motion. So data architecture you know, a lot of it’s gonna be built for data at rest in storage, but don’t forget data in motion, because you’re gonna need real time technologies for capturing that data to get value out of it.

Philip Russom: 05:56

And that belies the fact that data does travel into and across the data architecture. So the architecture has to have the appropriate data integration, other integration technologies to get the data in, to move it around inside the architecture. And later we’ll see that data gets very much improved within the architecture. So a lot of the same data integration tools will address data quality problems and other improvements. Don’t forget that data architectures do depend heavily on data platforms and data management tools. That’s what the data architecture runs on. Those things capture and manage the data for you. But again, don’t get so wrapped up in the platforms and tools that you forget. The first priority is data. And then finally a true data architecture really should unify everything above here. So all these bullets should be unified somehow.

Scott Geissler: 06:50

So obviously Phillip, your description of data architecture makes a ton of sense. And I was with you right up until that last bullet. How do you unify data architectures? Explain that please.

Philip Russom: 07:07

Yeah, yeah. There are multiple things, and you should apply multiple methods to unifying stuff. I think you get one point from this slide is there are a lot of components that go into an architecture. So how do you get those to work together? That’s a similar question. And one thing is I talked about metadata and the data catalog. You can unify a data architecture through your descriptions of the data, right? And that’s one of the things people do with metadata and the data catalog, you can go to your metadata tool, catalog tool, and you can see a catalog. You can literally see an inventory of data that’s available in the architecture. So that gives you a unified view of data assets that are available. I also mentioned data integration, you know, as data gets moved in and data from many sources are integrated and aggregated in the architecture.

Philip Russom: 08:01

That itself is a different form of unification. And a lot of the use cases for the analytics architecture will involve analytics tools that scan many data sets and sort of unify views of data elements that come from many different data sets into some sort of some sort of analytic representation. So if you look at a lot of tools for statistical analysis, data mining, the newer stuff for machine learning and predictive they, they actually unify data as they work on it. So you get the idea, there’s multiple ways to unify an architecture.

Philip Russom: 08:44

So you know with a data architecture, it helps to give somebody, give people something visual to think about. And so let me describe an image of data and analytics through a reference architecture, like most reference architectures. I’m not gonna try and put every detail in there. I’m just gonna have kind of the general scope of it and this sort of thing. So the way to think about a modern data architecture is that it typically has three big areas within it. And each big area is gonna have all kinds of details inside of it. So I always like to start my descriptions with the big area in the middle. And nowadays that’s called a data fabric. If you’re not familiar with the data fabric, the data fabric itself has its own local micro architecture, right?

Philip Russom: 09:32

So the data fabric is a kind of architecture within an architecture, and yes, architectures quite often include each other that way. So the data fabric is mostly about data integration and new forms of data integration, like data pipelining. And that helps us to get the data in and move it around. We also need a semantics layer where we describe data, like we talked about with metadata and the catalog. And then the data fabric also has a special area for data in motion. So realtime data has to have some functionality, and that’s all in the data fabric. So the data… I personally think of the data fabric as kind of a backbone for the larger data architecture, right? Because it’s what gets the data in, moves it around, improves the data, processes the data multiple ways for many different use cases.

Philip Russom: 10:21

But, you know, eventually the data goes to rest in storage, and storage is very important. And that’s where a lot of new product action has been in recent years. There are tons of new data based management systems. Many of them cloud based, many are open source. The older vendors have come up with totally new versions and this sort of thing. So a lot of great activity going on with the available database management platforms that are available today. And also, storage can be manipulated by various designs. And that’s really where the data warehouse, the data lake and the lakehouse live. And here in a minute, I’m gonna leave this and we’re gonna drill into the storage area of the architecture, because again, that’s where the lake warehouse and lakehouse reside. And then finally this architecture also has a lot of end user tools for business intelligence and the form of reporting or dashboards or advanced analytics. Maybe so advanced, just more like data science where you’re doing machine learning and artificial intelligence.

Philip Russom: 11:25

So all these things go to, it’s quite a list, isn’t it? So all these things go together into the architecture, but the architecture itself tends to have three very big and broad areas within it. So at this point I’m gonna drill into storage so that I can, talk about some of the recent developments and upcoming developments around the data warehouse, lakehouse. And so I’m gonna start with the data warehouse on that one. And for those of you who might be new to the data warehouse, let me give you a quick definition. You know, I used to work at Gartner as an analyst and so I’m just gonna borrow some Gartner definitions here. So the Gartner definition says that the data warehouse is a storage architecture designed to hold data extracted from transactional systems, operational data stores and external sources.

Philip Russom: 12:12

The warehouse then combines that data in aggregate summary form suitable for enterprise-wide data analysis and reporting for predefined business needs. You know, I think this is pretty good definition. Yeah, I have always thought of the warehouse as being very much dependent on storage, because that’s where the data goes. And we could step back from that and go back to my first bullet defining architecture, which is, it’s mostly about the data. So the data warehouse is an architecture, but it’s still mostly about data because that’s what data architectures do. Right? And so the data warehouse is designed by technical users. There are a few you can buy from vendors, but most most warehouses are designed in-house by data specialists. And the warehouse does combine data into aggregates. In other words, you just take data from X number of sources and fold it into one source. It just makes data more convenient for some of the common use cases and the most common use case of the warehouse.

Philip Russom: 13:11

And in fact, its very first use case is business reporting. And despite all the new analytics that’s coming down the road, companies still have to have business reporting because that’s the number one way that most people get their data, they get reports and the reports contain data. So you just can’t get away from reporting, but nowadays data warehouses are under pressure to support a wider range of use cases, some of them involving massive amounts of very raw unimproved data, and that would be typical of machine learning and artificial intelligence. So the warehouse, it’s still true to its roots and reporting, but it also has to address many different use cases, including all the way out there to machine learning and artificial intelligence.

Philip Russom: 13:54

And you know, there are some architectural ramifications for today’s data warehouse that I’d like to talk to you about, because this is kind of the… you know, these are the things that are changing. Some of this is still in process. So you need to think about this for the future. And you know, I’ve already talked about how a data warehouse architecture inherently involves multiple data sets. And when you look into a data warehouse, when you look at its internal architecture, it’s typically a series of data sets per business domain or subject area. These are sometimes called data marts or operational data stores. And when you look inside of these, what you see is a business domain like financials, right? So you’ll see financial data that’s been arranged just right. So it’s ideal for financial reporting for some kind of financial analysis. Data warehouses typically have a lot of operational information, so your chief operations officer has plenty of data to figure out, are we an efficient organization or are we a coordinated digital organization?

Philip Russom: 14:56

Another common subject area would be lots and lots of customer data for sales and marketing business domains, right? So you get the idea that data warehouses are typically organized according to business departments or business units, quite often called business domains. Sometimes they’re more organized by the needs of special data structures, but your common data warehouse has lots of relational data. One of the things that’s unusual about the warehouse is it has dimensional data, time series, sometimes hierarchical data. So it does have some unusual data models, but usually usually it’s relational. So data warehouses going forward, they’re still organized, even the new cloud ones, right? And I see a lot of those they’re still organized by business domains. So that piece is still with us. That part of the warehouse has not changed.

Philip Russom: 15:48

Now let me point out that few data warehouses have a monolithic architecture. A monolith architecture, that’s where you’ve got one brand, one instance of one brand of some platform, quite often it’s a database management system. So a lot of people will tell me, yeah, we’ve got a warehouse and all of it is on Oracle, IBM, Microsoft, you know, the usual. And I say, no, no, you need those databases to manage the data, but again, the data is your actual warehouse, right? So there is some confusion there also, but my main point is this monolithic architecture is where data warehousing started 30 years ago. Nowadays, monolithic architectures are only about 15% of the warehouses in the world. So most, most data warehouse architectures are distributed over multiple data platforms. And why would you do that? Why wouldn’t you keep it simple and keep all your data on one database management system brand?

Philip Russom: 16:44

Well, the warehouse, if you think about the warehouse and the data, it’s actually very diverse, isn’t it? You have all these different business departments and functions that have to have data very specially arranged for them. As companies move into more forms of analytics, every form of analytics has its own requirements for how the data must be arranged. So the way we prepare data for business reporting is really extremely different from the way we prepare data for machine learning or statistical analysis and so forth. So when you have really diverse requirements for data and for the business use case, quite often, that’ll lead you to buy multiple data platforms because each platform from a vendor—open source, et cetera, they will be optimized with certain use cases in mind. So if you have really diverse use cases, you may end up with multiple data platforms, and that has actually become the norm.

Philip Russom: 17:38

And today this kind of distributed architecture is even crazier, because pieces of your warehouse today can be on-premises while others are on cloud. So the distributed data warehouse architecture can even span hybrid architectures in cloud environments. And then finally, there’s this thing called the logical data warehouse, and it’s where you have lots of different data sets, but you put this extra layer on top of them. You know, you can use data semantics, like metadata, the catalog stuff I haven’t mentioned yet, like data virtualization. So you can use special tools of semantics to draw a sort of virtual view of data, even though the data in storage does not match up physically. Right? So one of the strong directions in data house architecture is toward the logical data warehouse.

Scott Geissler: 18:32

Interesting. So Philip, you and I have been in this business for a while. And one of the things that I started hearing at least eight to 10 years ago was that the data warehouse was old. It was expensive, it required a lot of pre-processing and work. Why is it still around? How could it be that the data warehouse is still here given all of those issues that we had been hearing about as many as eight years ago?

Philip Russom: 19:00

Yeah. Well, let me start by saying that there is truth to those criticisms—to gather datasets, to gather such diverse data, and then to figure out how to arrange that data, remodel it and get it ready for a wide range of use cases. That’s time consuming. And that’s one of people’s complaints, you know, “I need a new body of reports. I need a new set of dashboards. I need a new predictive analytic model. And you’re telling me it’s gonna take three months, six months.” And you know, we just don’t have the patience for that anymore because of the pace of business. Sure. So data warehousing can be slow. It can be expensive, and it’s not so much the data platforms and tools you’re buying—yeah, those cost money, believe it or not. It’s very high pay payroll expenses.

Philip Russom: 19:47

So the kind of data engineers you need, if you’re doing a lot of analytics, the kind of data scientists and data analysts, boy, those are really expensive people. Interesting. So we do hear complaints about the warehouse. So those are challenges, but you know, there are tens of thousands of warehouses in the world, and they exist because they provide business value. So if you want really high quality reporting that you can trust, you know, that data was gathered very carefully and it was put together in ways that we can audit it and therefore answer our trust questions. Also, if you data that’s free of data quality issues, which always pull down the use cases, right, then the data warehouse is the way to go. So organizations and we talked earlier about how you can’t let go of business reporting.

Philip Russom: 20:36

It’s still a requirement and there’s really no other platform that can do what I call the multi-source report. So a lot of times, especially at the higher level, the organizational chart, your chief officer has reports where data has to go into their overview of the enterprise. And because it’s enterprise scope view of things through that report, data has to come from many sources—be carefully modeled so that the data can be melded and fit into one report. So there’s so much high value for that sort of thing. And really the warehouse is the only platform that does that with the kind of trust level that you need.

Scott Geissler: 21:14

Excellent. And, you know, I think more recently, Philip, you hear a lot about data warehouse extinction at the hands of the data lake or the data lakehouse. So what’s your take on that? What do you see there?

Philip Russom: 21:29

Yeah, that’s a great one. Remember Hadoop? Yeah, Hadoop showed up about 15 years ago and a lot of people said, oh Hadoop can replace my data warehouse. And then we realized for a variety of reasons, no, it can’t. Because there’s certain things we need for that. The mini data sets that trust level, the documentation through metadata that’s essential to data audit and Hadoop’s just not up to that stuff. However, Hadoop is up to things like data science, because a lot of data science, machine learning, artificial intelligence, those are really tolerant of data in poor condition, right. They don’t really need the highest data quality. They don’t really need the best metadata. They just don’t need that. And for a lot of analytics, the analytics tool sort of makes compensations as it scans lots and lots of data sets in, in massive volume.

Philip Russom: 22:22

So, so we figured out that. Yeah. Okay. So Hadoop’s good for that. What did people do on Hadoop? Well, they did that kind of data science and eventually that became called the data lake, right? So we learned the data lake from Hadoop. Hadoop’s long gone, because it had a lot of problems, right. But the data lake is still with us and the data lake is still primarily built for the data scientist. Now, the data lake turns out to be a very flexible thing. It’s kind of like the warehouse and it can support many different, highly diverse use cases. So a lot of organizations build their data lake first for data science, and they even cover other forms of analytics. Some reporting is now supplied out the lake and still the warehouse, this sort of thing.

Philip Russom: 23:09

So the lake’s come a long way and it does a lot of great stuff, especially for data science. And see, data science typically needs pretty raw data in massive petabytes, sometimes petabyte scale. And if you put that on a warehouse platform and and pay for capacity, that’s a lot of money, that’s outrageous. And so for for those kind of data volumes the lake is more affordable. Elasticity makes the lake scale up to massive volumes and the really heavy data science workloads that go with it. So I think we’re very lucky to have the lake, but it doesn’t replace the warehouse. It it actually is a compliment to it. And that lets me just segue into data lake architecture. So the data lake’s kind of like a warehouse, it’s a concept that consists of a collection of storage instances of various data assets.

Philip Russom: 24:02

And these assets are stored in a near exact or even exact copy of the source format that they came from. And these copies of data in the lake are in addition to the originating data stores. And so the lake, why would the lake be set up this way? Well, you need to, for data science that I was just describing, you do need data from a very wide range of stuff, especially machine learning. For machine learning, to do auto ML, to learn for you to train an ML data analytic model you need data from many different sources to represent many different business entities and many different business processes… you know, many different kind of things. So the lake’s a great place to bring diverse data together for analytics that really thrives best with diverse data. We can make the same argument about statistical analysis, right?

Philip Russom: 24:56

So the lake turns out to be really good in a lot of ways, better than the warehouse for this kind of thing, because the lake scales up to massive data volumes better than the average warehouse will. So those are some reasons for the data lakes building. You know, the earliest data lakes were sort of thrown together without much thought for architectural design. And a lot of them became what’s called a data swamp. And a data swamp is where you just did not curate the lake. You allowed anybody to dump data in there. You did not document the data with metadata and other semantics. So nowadays we realize that we we’ve learned from those lessons. So nowadays we realize the lake does need design. It does need an architecture and does need things that keep it from being the swamp, namely data curation and lots of good, good metadata.

Philip Russom: 25:51

And you know, the data lake again, it’s an architecture that organizes storage, kinda like the warehouse does. Right. But with data lakes because it’s being built with kind a lot of new tools and new platforms, many of them, almost all of them, cloud based by the way, and many of them also open source. We got a whole new vocabulary. So the data units within a lake, it’s just different from what you’ve done on-premises. So if you’re gonna do data management on cloud, that’s great, but there will be some learning, training to go with that so that you learn how data lakes are organized by data zones, data buckets, tenants, folders, object store objects. It’s just different from what you’ve been doing on-premises, but it’s not exactly an alien world.

Philip Russom: 26:40

You can’t figure it out. So be prepared to learn those new pieces of the architecture. Data lakes typically have a number of data zones within them. But the biggest, and the one that gets used the most, is the data landing zone. So this is where data lands when it enters the data architecture, right? And again, we’re assuming the data lake is part of a larger data architecture, so data coming into that architecture, it comes in and lands in the lake first, and then from there, data gets triaged and routed, real time data can be captured as well. But typically that data starts in the landing area. And so that landing area can be treated as a kind of archive of this source data that’s brought in. And we store the data in the landing area exactly in its arrival condition.

Philip Russom: 27:33

We’re not trying to improve it because we’re gonna improve that data as we pull from it and create new data sets for a variety of use cases. And it’s a great idea. The lake, the really the big idea behind the lake is not its volume and stuff like that. The big idea behind the lake is to capture source data and and keep it in its arrival state, because once we’ve changed it for improvements, it’s really a different data set. So a lot of times you can’t tell the future, you don’t know what reports and analyses you’re gonna need in the future, but if you keep all this source data in its original condition, you can go back to that source and address pretty much any kind of reporting analytics as it shows up in the future.

Philip Russom: 28:16

So that’s one of the big ideas really behind the data lake. And of course that area becomes an important source for data scientists and other analytics. And they can just keep going back to that source with all kinds of analytic use cases, reporting—a lot of people are using self-service access to that data as well. So the landing you get the idea, the landing zone is a very big, very flexible and very important piece of the data lake. And it is an architectural component. And then from data landing data gets refined and staged because we need to refine data to get it ready for a wide variety of use cases. And all that happens. This refinement process happens once the data comes into the lake, through the landing zone.

Scott Geissler: 28:59

Fascinating. So Philip, when I listen to your description of the lakes landing zone, and also how data, you know, gets refined in the data lake, it reminds me of the kind of processing we’d expect from like the old data integration tools. Given that, don’t data warehouses have a long history of data landing and repeated passes for processing? I guess what I’m trying to say is it sounds like a data lake is redundant with data integration and data warehousing. Is that the case, and how do data professionals reconcile or rationalize the redundant functionality?

Philip Russom: 29:41

Yeah. So you noticed, did you? Yeah, there’s a lot of overlap between the lake and the warehouse. Isn’t there? A point I’m gonna make when I get to the lakehouse is that if you’re doing data landing and staging on a lake, and then you’re doing landing and staging on a data warehouse, that is indeed redundant. Just like you said, Scott. Okay. And so you’re you’re burning up a lot of storage—because of the redundancy, you’re burning up very expensive data engineering time. If you bought two different, completely different sets of data, integration tools and data bases for the two of them, then, you know, there are problems there because you know, that stuff costs money, but also you have to train people on multiple tool sets and track these tool sets and try to keep up with… you get the idea.

Philip Russom: 30:32

So this is one reason why we’re seeing an architectural shift when people put a lake and a warehouse next to each other. So there’s a variety of things and I’ll hit ’em in the next couple of slides, but in a nutshell, one of the reasons we have the lakehouse is to slam together the lake and the warehouse, so that instead of the redundant landing and staging, you got one landing and staging area. There are also a lot of temporary data sets that get made when the lake and the warehouse are kind of independent of each other. So you can get rid of that redundancy as well. So that, that sets us up. We’re talking about the data lakehouse. I’m just gonna move on to that. So here’s another gardener definition. It says that data lakehouse is a converged data architecture that combines and unifies the architectures and capabilities of a data warehouse and a data lake.

Philip Russom: 31:24

It is usually deployed on a single platform and that platform is usually cloud based. So this setup enables data and analytics leaders to reap the leading benefit of the data lakehouse: the reduction of architectural redundancies, just like we were talking about a minute ago. And so, so let’s talk about some of the architectural ramifications with this kind of lakehouse. And you know, one of the benefits is if you do it right, the lakehouse unifies the data lake in the warehouse architectures, and you can reduce redundancies, but you can also make it easier and faster to do things like data ingestion, data improvement, use by the warehouse and other analytic practices. Because the data refinement process when stretched over lake and warehouse tends to be kind of overly complicated. You can simplify that, make it faster, more agile through the lakehouse.

Philip Russom: 32:18

So that’s, that’s one benefit we’re seeing there. But there are also some issues. One is that a lakehouse still has a lake and a warehouse in it, right? And so I talked earlier about how some lakehouses, I’m sorry, some data lakes, not the lakehouse, but some data lakes are designed for data science. And that’s a pretty narrow scope. They can be very powerful for data science, but that’s different from what I call the enterprise data lake. So I’m seeing the, so one of the trends with the data lake is toward a very big data lake that sports a long list of use cases in both analytics and operations, as well as information, lifecycle management issues like data archiving, et cetera. So typically the lake that’s in a lakehouse, it tends to be kind of a more narrowly defined kind of a strip down lake because the lake is only there to support the warehouse. See what I mean? The lake in a lakehouse is rarely an enterprise scope lake. It’s really a data warehouse scope lake with that scope stretched out to also cover data science. So that’s, that’s one ramification for this. And then also just beware, we talked about the swamp earlier, right? So the lake and the lakehouse, if you don’t take care of it, if you don’t curate it, give it proper metadata, it can also become a swamp.

Scott Geissler: 33:43

Philip, what is a data swamp? I mean, clarify that for me, because I’m laughing at the terminology here.

Philip Russom: 33:50

Yeah. What makes it a swamp is that the lake is really not very usable. And if you don’t describe data with metadata catalogs et cetera, you can’t find the data. And also, if you do find the data then you’ll find that wow, the same data is in here three times, which version do I use? And so there becomes a trust issue, and if users don’t trust data, they don’t use it. So there’s no value, right? Also curation, it’s important to have a human being, to be a curator, kind of a gatekeeper for the lake to keep it from becoming a swamp. So one of the things a curator does is say, No, you cannot just dump data in my lake anytime you want—you have to talk to me, tell me what you want to bring in.

Philip Russom: 34:35

And I’ll make sure that it’s okay. And quite often the curator will say, well, the lake, the data you want to bring into my lake, it’s already here. You have colleagues who brought this in for a similar analytics project or something like that. And so let’s see if we can sort of upgrade the data that’s already here, so you don’t bring in a redundant copy. So what’s the problem with redundant copies of data in the lake? Well, you have a lot of data scientists, other analysts who will scan data really broadly. And they may not realize that some of these data sets they’re scanning are actually multiple, but slightly different, copies of the same data set. And the problem with that in analytics is that it will skew the outcome of your analytics.

Philip Russom: 35:25

For example, if you’re doing statistics, because of the redundant data, certain business entities and events and stuff will occur far more often than it did in the real world. And so your statistic based on that redundancy is not gonna be representative of what’s happening in your organization. See what I mean? And similarly with machine learning, as you’re training, as you’re training analytic models, if certain business entities and processes are overly represented, then the model will be skewed towards a world that really doesn’t exist. So you get the idea, the lake, it can be a lack of documentation, or it could be a matter of redundant data.

Scott Geissler: 36:06

Interesting. Now from a unification perspective, Philip, obviously the best of all the worlds are the lakehouse, right? Unifying the warehouse with the lake and creating the lakehouse. So how do you do that? How do I unify a lake with a warehouse to create that lakehouse?

Philip Russom: 36:27

Yeah, some of it can be similar to what we talked about in terms of unifying any data architecture. So descriptions of data through metadata and data catalog tend to unify, give you a unified view. If you have data pipelines that are moving across multiple data sets, then those tend to create a kind of unification at the integration layer. And then also we haven’t talked about it today, but there are modeling techniques where instead of having redundant data in multiple data sets, you may have some data in a data set, but the rest of the data is virtual pulling from some other data set as opposed to making a redundant copy. And then the lakehouse itself helps to unify by getting rid of redundancies. And we already talked about that at length. So with the lakehouse unifying the lake and warehouse halves, in a lot of ways, it’s the same techniques we would use to unify any large architecture.

Philip Russom: 37:34

All right, well, let’s look at some platform issues, more of them, with the data lakehouse. And I wanna talk about this really important question that comes up with the lakehouse, and this strikes at the way some people define the data lakehouse. And the question is, do I deploy my data lake and my data warehouse on one instance of one data platform brand? Or do I deploy the lake and warehouse on different data platforms? And this is one of the hot debates going on right now with the lakehouse. A lot of people say, yeah, I think the lakehouse will be beneficial to us, but we have to deal with this architectural question, and you could kind of bring it down a layer and say, well, it’s more of a deployment issue than architecture, but the two are related.

Philip Russom: 38:21

And so a lot of people actually define the lakehouse as single platform, and they will defend that position in every debate and argument. And it’s because, I think they have a good idea behind it, which is that they think the lakehouse benefits are more likely to happen if the lake and the warehouse are on one platform as opposed to distributed across two or more. But let’s be honest—sometimes it’s really because there’s a vendor and they want all your data on their platform and no other platform. So I do see vendors being a little sneaky there, about over promoting the lakehouse. But you know, there are other people who prefer to deploy on optimized platforms and they will put their data lake on a platform that was built specifically for data lakes.

Philip Russom: 39:10

And then they’ll put the data warehouse on a platform specifically optimized for data warehouses. And, you know, if you take technical requirements for your use cases really seriously, it does lead you inevitably to have multiple platforms. And there’s a long history of thinking that way. Are you getting the idea here? It’s really an argument: Do we put the lakehouse on one platform because we think we’re more likely to get the unification and fewer redundancies, or do we put it on two other platforms because that way we’ll have the best possible lake platform and the best possible warehouse platform? And we’ll just live with the redundancy as well as the additional data engineering and data integration. You have to move data between a distributed lake and warehouse.

Philip Russom: 39:57

So these are important questions people are grappling with at the moment, right? And this is a big challenge in the world of vendor platforms and open source platforms, because you know, there really aren’t that many platforms that are equally good at the lake and the warehouse. See the catch here? So if you’re gonna do a lakehouse on some platforms, you’re gonna have a killer lake, but not such a good warehouse, or you’re gonna have a killer warehouse, but not an enterprise scope lake. See what I’m saying? So some people are willing to make a sacrifice on one side of the balance there. So that’s another thing.

Scott Geissler: 40:35

Philip, in your expert opinion, what’s your expert recommendation concerning deployment on one data platform versus two or more?

Philip Russom: 40:47

Yeah. So you’re so you’re putting me on the spot here, so I have to make a decision and you know, I personally have a standard recommendation, which is, I think in the long run in the long run, you’re gonna get more out of two optimized platforms because you’ll just be less limited. So instead of having you know, half of your lakehouse hamstrung, like I described a minute ago, right? Instead of half of the thing being hamstrung from the beginning with two optimized platforms, you know, if you choose those wisely then you’re gonna be able to build the best of possible lakes and the best of possible warehouses without compromise on either half. But I also know, sometimes it’s not an ideal world, is it? Sometimes we can’t just say, forget about our existing arsenal of tools, if we had all the time and money and we could choose any platforms, you know, we would go with two optimized tools.

Philip Russom: 41:47

So the reality is some organizations have already made commitments, and so sometimes existing commitments will decide where the lake goes and equally where the warehouse goes. So in the real world that does happen. But I do see, you know, I do see a little of everything out there because, especially with my last couple of employers, I was talking to users constantly. So I’ve seen a little of everything. So I’ve seen the lakehouse work either way. I’ve seen the lakehouse work very well on one platform. I’ve seen it work equally well on two or more platforms. So both are certainly possible. And I haven’t really talked about this, but the lakehouse is quite new, it’s only showed up in the last couple of years. The data lake itself is about 12 years old. Warehouses are like 25 to 30 years old. Right. So the lakehouse very much the newcomer. So we’re still sort of sorting out the best practices and that sort of thing. But I would say the lakehouse is here to stay regardless of how you’re gonna define it. So you might as well embrace it. And if you’re gonna do that, if you can, if you can get away with it, go with two optimized platforms, because I think in the long run you’ll have a better lake and a better warehouse.

Scott Geissler: 42:56

Well, that’s great. Thank you, Philip. That’s awesome advice. Well, we’ve come to the end of our time. I must say my head is spinning a little bit. You know, there’s really a lot to modern data architectures and especially those for data warehouses, data lakes, data lakehouses, and we’re lucky to have had Philip to come on here and describe the differences in each for us. We really appreciate you Philip and appreciate your time and look forward to our next opportunity to speak. Thank you so much.

Philip Russom: 43:26

Well, it’s been a pleasure to see you again, Scott, working with you again. Thank you very much and also my special, thanks to all the people at actual tech media who helped to set this up.

Scott Geissler: 43:35

Thank you. Appreciate it, Phillip.

Share via: