What is a data lake?

One doesn't go far into a conversation about “modern BI” these days without someone inevitably bringing up the topic of the data lake. Some in the business intelligence (BI) world are excited about this relatively new style of data repository because of its big data capabilities, while others lambast it for its drawbacks, most commonly the lack of data governance.

As today’s BI and analytics landscape is increasingly making use of the data lake, it’s a good idea to familiarize yourself with the concept and its advantages and disadvantages.

What Is a Data Lake?

Like a data warehouse, a data lake is a repository for data. As far as data storage goes though, that’s about where the similarities stop. Data within a data warehouse is processed and structured. As for a data lake, in addition to structured data, there is also semi-structured, unstructured, and raw data.

There are various data sources in a variety of structures that can feed into a data lake in the same way that rivers, streams, and other tributaries feed into an actual lake. And just like a real lake, the unstructured nature of a few sources can turn the lake into a chaotic swamp. When analyzing data, users can take small samples or dive in and explore as much of the data as they want.

This attempts to solve the problem of data silos. Instead of dozens of strictly managed, separate data collections, a data lake pools everything together. This promotes an increased use and sharing of data. It also cuts the costs of server licensing.

How Is a Data Lake Different From a Data Warehouse?

In addition to the general structural differences between a data lake and a data warehouse, there are a number of additional differences that separate the two, including data types supported, speed, usability, and flexibility.

Data Types

In a data warehouse, data is carefully considered and structured before being pulled in. This is known as a “schema on write” approach to data storage. A data lake, however, takes all data in its original form. That includes that data that would be useful to analyze today, in the future, and perhaps never at all.

Every data type is supported, including non-traditional data types like text, images, social media content, and web server logs that you won't find in a data warehouse. This is possible because, as I mentioned above, a data lake maintains data in its raw format and only transforms it when it is ready to be analyzed. This approach is known as “schema on read.”

A data warehouse only stores data. Data that needs to be analyzed is taken from the cubes on top of the data warehouse that process it in a highly structured format. A data lake, however, processes data in its raw format. Whichever form it comes in is how it will be analyzed before it goes out.

Speed

Processing, cleansing, and transforming data for a data warehouse solution design takes time. Because this step is eliminated in a data lake, users have instant access to the data they want to analyze. Information Designers can quickly configure, re-configure, and otherwise experiment with data on the fly for powerful ad-hoc purposes.

This type of agility isn’t for everyone though. Not everyone wants or has the proper skills to get their hands dirty with data exploration. And the very nature of raw data means that data governance is essentially non-existent. Data governance is the responsibility of the users, who should employ tactics such as a closed-loop system or sandbox analytics. Without this, the data lake risks becoming a mess of disconnected silos and unusable data.

Usability

Data warehouses are extremely powerful. By principle, they're designed to make it easy to link data across various dimensions. However, they can also be extremely cumbersome. Among the various types of users who utilize BI on a daily basis, only the highly technical Information Designers can get under the hood and make changes to a data warehouse.

A data lake, however, is much more agile. Information designers can fully immerse themselves in the large and varied data sets they need, while non-technical Business Users can pick and choose from the more structured data sources within the data lake. The structured data is easily ordered and processed within the data lake, resulting in an output of analyzed data that users can quickly sift through to gain insight.

Flexibility

By definition, a data warehouse is highly structured. While this makes it a powerful storage option, it makes changes within the data warehouse difficult. Therefore, the biggest benefit of the de-normalized data warehouse is also its flaw. Any work down within a data warehouse falls to a highly skilled Data Scientist or Information Designer. Ad-hoc analytics are impossible with just a traditional data warehouse structure, as any new data has to first be folded into an appropriate cube.

That’s why the increasing demand for self-service BI makes a data lake highly attractive. Users are empowered to utilize and experiment with data outside the data warehouse and don’t have to wait for IT to find time for their requests. That’s not to say the flexibility of the ungoverned data lake doesn’t come with a toll. Don’t forget that unstructured can quickly lead to chaos for those who don’t know what they’re doing – and even those who do.

Ready to Dive In?

At TARGIT, we’ve been proponents of self-service BI since its inception. After all. what good is BI if it doesn’t put the power of data discovery in the hands of every decision-maker?

When considering data lakes and data warehouses, it doesn’t have to be an either/or decision. Why not go bimodal and harness the power of both?

A data lake is a low-cost alternative for data storage for companies who want to utilize external data. The data lake can pull directly from hundreds, if not thousands, of external data sources and serve as a dumping ground until that data is pulled into the front-end business intelligence system. This makes the process significantly faster. Data lakes also encourage self-service data discovery. All of this, combined with the structure and security of a data warehouse make for unrivaled access to actionable insight.

Newer tools have emerged that make it possible to bridge the gap between the data warehouse and a data set such as the data lake, such as TARGIT’s Data Discovery module. With this, users can blend data outside the data warehouse with data within it, making it possible to experiment with and prototype data outside the data warehouse.

Data Discovery comes with native connections to dozens of data sources, including Hadoop’s data lake, that can be combined with each other and with data inside the data warehouse in just a few clicks.