logo

Stay in-the-know: subscribe to the dataverse.ai podcast

What Is A Data Lake?

What Is a Data Lake?

Author

Ted Cuzillo

Industry Analyst and Journalist, datadoodle.substack.com

A data lake is a centralized repository where you store structured or unstructured data without having to transform it first. In other words, you simply store it as-is and use it for data analysis. You can store any type or volume of data in full fidelity, whether the data comes from on-premises, cloud, or edge-computing systems. This provides tremendous flexibility in storage on a scalable and secure platform.

A data lake allows you to process data in real-time or batch mode and analyze that data using SQL, Python, R, or any other language, third-party data, or analytics application.

The Benefits of a Data Lake

There are several key reasons organizations choose data lakes.

Lower TCO

The cost to store data in a data lake system is typically significantly lower than databases, which often require complex infrastructure and filtering. With a data lake, you only pay for the storage space. For organizations that store mass amounts of data, this can reduce the total cost of ownership (TCO).

Companies don’t need specialized hardware for a data lake, and data sets don’t need to be indexed and prepped before storage.

Simplified Data Management

When organizations deploy data lakes, they can eliminate the data siloes that often exist. With a central repository, companies can also avoid the challenge of having to move data between data warehouses and data centers.

Used as a separate layer, a data lake can mitigate the costly egress fees – fees paid to transmit data out of a cloud platform – required to move data from one cloud provider to another for processing.

Flexibility

Traditional data warehouse platforms are schema-based, meaning that your data has to be stored in a specific format. When you set up your database, you have to decide its structure. Before data is added, it has to be cleansed and standardized. If new types or data formats are added, you may have to rebuild the database to accommodate it. With a data lake, you have flexibility because you can store any type or format of raw data.

Data that doesn’t fit squarely into a database slot, such as social media posts, customer support notes, images, sensor data, and other unstructured data, can still be available for data scientists.

Data Democracy

A data lake makes data available across an enterprise to any authorized user. This allows companies to build a more data-centric culture and provide middle managers and frontline employees the data access they need to make better decisions.

Accelerate Data Analytics

Data warehouses generally rely on SQL, which may be fine for simple analytics. However, more advanced data analytics may require more flexibility in assessing data. Data lakes allow for more options. For example, you can store data from multiple sources in multiple forms as raw data for data scientists.

Data is then prepared for applications such as artificial intelligence and machine learning, predictive analytics, and data mining.

Challenges with Data Lakes

While there are plenty of benefits, challenges also exist regarding data lakes that organizations need to overcome.

Data lakes make it easy to save everything. While storage is more cost-efficient in a data lake, it’s still an expense. Further, constantly evolving and generally complex legislation around the world is adding new challenges associated with storing certain kinds of potentially sensitive data. Organizations need to put parameters on what should be saved. This requires data governance policies.

Data Governance

While data doesn’t require cleaning or transforming to be stored in a data lake, an organization still needs strong governance to ensure data quality. This includes:

  • Policies and standards
  • Roles and authentication
  • Data processes
  • Data management

Data governance covers every aspect of storing and securing enterprise assets to ensure quality and accountability.

Major Data Lake Providers

The list of major data providers includes the major players in the cloud service provider (CSP) space, including:

  • Amazon Web Services (AWS)
  • Microsoft Azure
  • Google Cloud Platform
  • Hewlett Packard Enterprise (HPE)
  • IBM Cloud Computing
  • Oracle
  • Snowflake

As you can see, data lakes have significant advantages in modern environments. But knowing their strengths and weaknesses before investing a dime is crucial to getting the most out of them. Make sure they’re a good fit for your environment and business objectives first.

Ted Cuzillo

About the Author

Industry Analyst and Journalist, datadoodle.substack.com
Ted Cuzzillo is an industry analyst and journalist with more than 20 years’ experience explaining, analyzing, and researching the often fraught interface between data technology and those who use it. To paraphrase Lance Armstrong, who says “It’s not about the bike,” Ted says, “It’s not about the computer.” His current research focuses on business analysts, including the tools they use, the roles they play, and their careers. His weblog and podcast are at datadoodle.substack.com, and his geographic home is the San Francisco Bay Area.

Share via:

Facebook
Twitter
LinkedIn