When it comes to Business Intelligence (BI) products, most of the reporting tools in the space presume that your organization has a data warehouse in place.
A data warehouse is most commonly a relational database with data ingested from transactional systems (also known as online transaction processing, or OLTP, systems) on a regular cadence or even in real time.
The action to ingest the data into a data warehouse is called an extract, transform, and load (ETL) process. ETL processes were traditionally run daily, but modern data demands typically require them to be run more frequently—in some cases, even streaming real-time data into the data warehouse.
Business analysts, data engineers, and organizational decision-makers access this data using BI tools like Power BI, Qlik, and Tableau, along with other analytical applications that make use of the data.
Data in a data warehouse is commonly organized into tables within a single database. At the core of most data warehouses is a “star schema,” in which data is organized into “facts” and “dimensions.”
Facts represent quantitative pieces of information, like a sale or a website visit. These fact tables have foreign key relationships to dimension tables. Dimensions are related to the facts and describe the objects in a fact table. A sample star schema is shown in Figure 1.
This data model (also known as a dimensional model) is very efficient for queries that read data—most data warehouses also optimize for the bulk loading of fact data.
Data warehouses are known as online analytical processing (OLAP) data structures, as opposed to OLTP databases. This structure is a trade-off, which comes at the expense of storing redundant data in this structure (keys are stored in both fact and dimension tables) as part of this denormalized design. The queries against the dimensional model are simpler (fewer joins), and many metrics are simple aggregates (sum, average, maximum) of columns in the fact tables.
Dimensions which are loaded less often are also designed to deal with changes in data over time. Some examples of this include names of places, customers, or products. This concept, called “slowly changing dimensions,” allows for queries to historically analyze data while considering description changes.
While there are other types of schemas used in data warehouses (for example, the snowflake schema), star schemas make up most data warehouse workloads.
Since by its nature the data warehouse aggregates data from a variety of systems, its databases can become very large. Most data warehouses are terabytes in size, and it’s common for them to reach 50 TB or more. This can require a great deal of processing power, so many data warehouse systems use scale-out approaches to improve query performance. On-premises systems include Teradata and Netezza, while cloud offerings include Amazon Redshift, Azure Synapse Analytics, and Snowflake.
The systems all distribute fact data across multiple compute nodes, allowing aggregations to receive help from massively parallel processing (MPP). While the MPP products draw a lot of attention and marketing dollars, most data warehouses run within a single server.
These terms are often used interchangeably, and many products in the space support using all the terms. Here’s a breakdown of each one.
Most data warehouses are built on top of relational database engines (RDBMS), which are commonly referred to as database servers. While some NoSQL solutions are not built on an RDBMS, most data warehouses operate as a database on an RDBMS.
The other confusion that sometimes comes into play is referring to databases in the sense of OLTP systems only. This is inaccurate—if your data store has tables and runs on a RDBMS, it is indeed a database.
This distinction gets a little bit more confusing because most modern data warehouse systems support ingesting data from a data lake.
A data lake is typically an object store (think Amazon S3 or Azure Blob Storage) into which substantial amounts of files get written. Unlike a data warehouse, a data lake does not have predefined schemas.
Some data warehousing systems are built around data lake—Databricks is good example. Other data warehouses support ingesting data from the data lake through the ELT process. Typically, “external” tables (which are only metadata objects that don’t store any data) get created in the data warehouse. From those external tables, data is ingested to regular tables within the data warehouse.
A data mart is also a dimensional model, but it represents a subset of the total data warehouse. Typically, this data is related to a single line of business, department, or geography. Since a data mart deals with a smaller data volume, it allows the team to run more in-depth queries over the data without impacting other users. The main difference between a data mart and data warehouse is that the scope and size of the data is much smaller in a data mart.
The biggest advantage of having a data warehouse is that it puts all your business’s data in a single data store, providing a single version of the truth.
This is important because reporting against operational systems is challenging in terms of both performance and data quality. While real-time reporting has it uses, reporting queries can block sales transactions, or show pending transactions that may be cancelled. Beyond those challenges, OLTP schemas are designed to optimize writing data rather than reading it.
Having a simplified star schema is good for both query performance and organization—and they are well understood by business intelligence tools, which makes things easier for business analysts.
The structure also provides very good query performance, even as data volumes ramp up. This has been helped by the adoption of column storage for most fact tables, but even older versions of data warehouses performed efficiently.
Data warehouses also allows for including data quality and master data services as part of the data ingestion process, which results in better data quality. All of this means your business users can get faster, better answers which can lead to actionable insights and a healthier bottom line.
Data warehouses large and small are at the center of most business intelligence architectures. By offloading reporting from operational systems into a read-optimized system that’s easy for analysts and data scientists to query, your business can make better decisions. In addition, the performance optimizations and scalability available through MPP data warehouses mean that no matter how much data you have, the data warehouse can meet your performance needs.
Artificial intelligence (AI) and machine learning (ML) are constantly being used interchangeably in the tech landscape. But are they the same thing?
The answer is No. They both combine machines with the study of cognitive science in an attempt to mimick how the human brain performs thought, learning, and mental calculations. However, they have distinct qualities that invariably make them two separate concepts. Let’s take a look.
AI is the field of computer science that aims to study and create machines that can exhibit human behavior independently, without human control.
ML is a process that allows machines to learn from data without explicit directions, using algorithms. It essentially involves an algorithm—a formula representing the relationship between certain variables—and a model, which is a program that can find patterns or make predictions on new, incoming data.
AI is a field of computer science, the same way cybersecurity, software engineering, human-computer interaction, data science, and many more are fields of computer science. ML, on the other hand, is a process in which computers learn, and is a subset of the AI umbrella.
“Weak AI” is AI trained to perform very specific tasks, and is the most commonly used AI in our everyday products. Amazon Alexa, IBM Watson, and Tesla Autonomous Driving Software all fall under this type.
“Strong AI” is mostly theoretical. It obtains the same (or greater) computational capabilities as a human, with self-awareness and the ability to plan for the future. An example of strong AI would be the fictional program Jarvis in Marvel’s Iron Man.
There are many different learning methods for ML algorithms, but they all boil down to three overarching methods.
Supervised learning is when models are trained on labeled input data, and the outputs are compared to the target (or desired) outcome, and used to determine how well the model performed.
Supervised learning has been used to detect spam. Given emails labeled as “spam” or “not spam,” it learns what would be considered spam. When given new emails, the model will then label them as spam or not and will be scored based on accuracy.
Unsupervised learning is used to describe a relationship within data. Instead of aiming for a target, unsupervised learning intakes unlabeled data and determines what relationship all the data elements may have.
Clustering is a form of unsupervised learning. When given data, the model will cluster or group data based on similarities it determines. An example of clustering is its use in grouping retail products. Given product description data, the unsupervised learning model will group the data based on features it determines on its own. It could end up grouping products by color or size or some other feature entirely.
Reinforcement learning doesn’t train on input data at all. Instead, it’s given a goal to achieve and actions it’s allowed to perform. The feedback received will tell the algorithm if it has reached the goal or not.
You can see an example of this in a game where a computer agent is given the rules of a game, a point system, and a goal to get the highest score. Through its trial and error of moves, it will eventually learn how to achieve the highest score possible.
Within ML, specific subsets focus on different learning designs. The two most well-known subsets are neural networks and deep learning.
Neural networks is a branch of ML whose design mimics how human brain neurons process and pass on information. They’re made up of layers of nodes in which each node holds weight or value and passes data to other nodes in the next layer with little human interference.
They’re usually used to solve problems and detect patterns on unstructured, non-linear, or otherwise complex data.
Deep learning is what we call the group of algorithms that use neural networks to solve problems. Specifically, it uses a neural network with at least three layers. In basic ML, humans prepare the data in some way before feeding it to the model.
Deep learning, however, takes in mostly unprocessed data and learns for itself which features are important in making correct predictions.
An amazing example of deep learning is its use in AI-powered chatbots. These chatbots use a combination of natural language processing (NLP) and deep learning to understand human language.
NLP is a branch of AI devoted to teaching machines how to understand text and voice, and to respond on their own, ultimately mimicking human conversation. For AI-powered chatbots, deep learning can be used to assist in NLP goals by helping NLP systems learn the meaning of words in real time.
The main but subtle difference between AI and ML is mostly hierarchy. AI is a study and field of computer science that encompasses any technology that attempts to mimic or surpass human intelligence. ML is a subset of the AI field, and describes a process of how machines can learn patterns and make predictions on data.
The traditional data analytics architecture was pretty straightforward—organizations had a number of operational databases, which were nearly always relational databases, with the occasional file share.
An extract, transform, and load (ETL) process would sweep data from those systems as part of a batch process. The batch would typically run overnight, but in some cases more frequently—for instance, every four hours. The end result of that ETL process would be a star schema consisting of fact and dimension tables in a data warehouse, where business users ran predefined reports.
The data warehouse would be another relational database, and most of the reports would be defined as stored procedures within that database.
Traditional data warehouse architecture is still good—in fact, star schemas are in the middle of a number of very modern data analytics architectures. However, a couple of factors have dramatically increased the volume of data—mobile phones and social media being the two biggest technical factors.
Whether it’s ingesting data from mobile applications or using social media streams for sentiment analysis, both the volume and the velocity of inbound data changed dramatically in the 2010s. The popularity of Big Data led to the adoption of systems like Hadoop, and the subsequent development of Apache Spark, which has become a mainstay of the modern analytics architecture.
By contrast, traditional ETL processes struggled to manage the volume of these sources, and businesses also demanded closer to real-time reporting—it was no longer enough to see yesterday’s sales.
The other factor that changed everything was the broad adoption of cloud computing. One of the earliest features in the public cloud was object-based storage—in the form of AWS S3 and Azure Blob Storage, which allows files to be created as http endpoints, and allows nearly infinite amounts of data to be stored.
Object storage is at the center of most modern “data lake” architectures, as it allows data to be ingested quickly at large scale. Also, typically as opposed to the traditional ETL process, data in object storage is subject to “schema on read” or extract-load-transform (ELT) processes where the transform happens virtually, and as the data is ingested into its persisted store. This is typically a cloud data warehouse, which comes in a number of forms.
As data volumes grow, scaling a data warehouse out is one of the better approaches for supporting those volumes—memory is a big constraint, and there are limited amounts of memory that will fit into a single server.
Before the advent of the cloud, scale-out—or, as they are specifically known, massively parallel processing (MPP) data warehouses—involved buying and licensing an expensive set of hardware from Oracle, Teradata, Microsoft, or other vendors in the space.
Now the cloud offers the ability to rent that hardware at small, large, or even changing scale. Services like Amazon Redshift, Snowflake, and Azure Synapse Analytics all make that possible. And there are other vendors with similar offerings built on non-traditional engines like Databricks, which are built on top of Spark.
It’s not so much that these services are cheap—the top SKU of Azure Synapse Analytics costs $300,000 per month, although the cost of entry is much lower—but that they’re more cost effective.
This is especially true of upfront costs, which can be orders of magnitude less with the cloud. Instead of buying a rack of servers and storage, you can pay a cloud vendor a few hundred dollars a month and get started with something that would have taken millions of dollars and multiple months, or even years, to get started with in the pre-cloud world. On top of that are the (often) hundreds of thousands of dollars of services costs typically associated with an on-premises data warehouse implementation.
Another key technology change that allowed for large data volumes while simultaneously improving overall performance is the proliferation of columnstore storage for data warehouse workloads.
While the implementations are slightly different between different databases, the technology basics are the same. Typically, a database stores data on pages (depending on the RDBMS, these are somewhere between 4KB – 1MB, though generally smaller). Each page is made of data from the table across all columns, which allows for easy single row updates and inserts.
Since the data on these pages contains data from a lot of different columns, when you apply compression to tables and index, the rate of compression is typically 30%. Also, both the number of columns and the selection of those columns have to be carefully chosen for indexes in order to balance out read and write performance. This makes traditional database indexing both an art and a science.
However, data warehouses don’t typically do a lot of single row operations—inserts and updates take place in batch operations. Columnstore indexes store their data grouped by columns. Since data in columns have much more data in common (for example, the number of customers with a state name of “California”), it allows for much greater levels of compression.
Additionally, each column effectively becomes its own index. This means columns not referenced in a query are simply not read, reducing the overall IO impact of any given query and allowing much more efficient usage of memory. This concept even applies to data lakes, as Parquet files use a similar storage construct.
There are a number of aspects beyond cloud data warehouse and object storage that make up most modern analytics platforms (Figure 1).
One key difference is that ETL developers have largely been replaced by data engineers, who perform a similar function of distilling raw data into useful information. These engineers, though, are expected to support more data sources and have access to APIs.
The other role that has changed the way data is consumed in enterprises is the data scientist. They consume datasets that are typically produced by data engineers, and perform experiments with various machine learning techniques.
Data volumes continue to increase over time, for a number of reasons. These include more disparate data sources, more sensors for data collection (for instance, Internet of Things (IoT) devices), and well as longer data retention times.
The adoption of both columnstore and massively parallel processing systems, and the lower cost of entry for organizations of all sizes, has led to the wholesale adoption of modern data analytics technologies.
Beyond the data warehouse, Spark allows more modern data paradigms like streaming and machine learning.
The broad adoption of the public cloud, and the fact that the volume and velocity of data under analysis continues to grow across all industries, means that these modern analytics paradigms will continue to grow and evolve.
The general trend of the evolution has been that even though system complexity and power has continued to grow, the ease of deployment has become easier. That doesn’t often happen in the IT world.
Self-service business intelligence (SSBI) has long been the holy grail of data warehouse architectures. Many business intelligence software vendors like Tableau, Qlik, and Microsoft (Power BI) like to make SSBI appear to be as easy as getting data into a data model and handing it off to business analysts.
The reality is that delivering a data set that allows business users to build their own reports and dashboard requires a great deal of planning, architecture, and governance to ensure that users can:
Figure 1 provides a high-level view of SSBI design. There are many variations on this theme, but at its core, business intelligence gets data from source systems that provide operational functionality.
It then uses a process of extraction, transformation, and loading to ingest that data into a reporting-optimized pattern in a data warehouse (which is traditionally another database itself), called a star schema.
The star schema is made up of facts and dimensions—a fact table, for example, would house individual sales data containing keys referencing products, dates, and geography information. These keys can be joined to dimension tables further describing those properties.
While getting this data into a data warehouse allows BI developers to easily build reports on request without impacting source systems, data in a data warehouse isn’t usually at a point where business users can start using it.
For that to happen, a semantic layeris created, which helps translate esoteric column names like ProductID and EnglishProdNm into terms like Product SKU and Product Name that business users can more easily understand.
This layer is also used to help define key metrics and calculations for important business questions, which allows a consistent definition for those questions. When the semantic layer is correctly defined, it allows business users to quickly and easily drag-and-drop data elements into a reporting tool and be assured that the data is correct.
Another goal of the semantic layer is reducing complexity within the data to simplify reporting for business users. Those SSBI users shouldn’t have to understand the difference between inner and outer joins or the basics of SQL syntax.
That simplification can be done through things like dimensional modeling, ensuring data consistency, and making sure the SSBI users can’t accidentally bring down the data warehouse with a couple of bad reports.
And those users shouldn’t have to be concerned with the complexities of your dimensional model. You should work to simplify the presentation of your physical data sources within the self-service model to make life easier for them.
While having a semantic layer and handing off reports to customers is critical, the most important part of any reporting effort is data governance. Data governance includes rules about how data is sourced, cleansed, managed, distributed, and used in your BI systems.
Data governance can also include a master data list. This helps resolve data conflicts of the type that can happen when, for example, a customer address is entered two different ways in two different systems.
It can be hard to kick off a data governance effort, especially if the company is undergoing a business intelligence project. Don’t let those challenges deter you, though—the value of governance is in ensuring that data can be trusted.
Having trusted data means better data and easier reporting. Having quality data that’s consistent will help increase the adoption of your SSBI effort, because analysts will be more apt to use those tools, instead of relying on their own legacy Excel spreadsheets complete with esoteric formulas that aren’t centrally managed.
Beyond the benefits for your business intelligence reporting, data governance has many benefits around security and metadata management that can help your organization become more data focused.
Building all the best tools doesn’t mean that your users will immediately find them intuitive and easy to use. The most commonly used business intelligence tool is still Excel—many business and finance users are most comfortable using spreadsheets, and will take some coaxing to use more modern BI tools.
Your users need training to get there, of course. But it goes beyond just learning new tools they’re using—they need to learn how the data works, so they can start building and expanding their own models as they get more comfortable working with their own data.
Training users isn’t as simple as running a one-time weeklong training class—you need to mentor users and ensure they’re getting the benefits of the tools and processes in place.
Another consideration is that most SSBI tools provide metadata on usage. It’s important to understand how, and how often, data is being used within the organization. This facilitates gauging adoption and focusing mentoring efforts where utilization is lower.
SSBI vendors frequently sell the ease of use of their tools and platforms without mentioning the process complexity that’s involved in building those data models.
It’s important for both IT and business organizations to understand the significant effort level involved in a successful SSBI project. Beyond just implementing a data warehouse and a semantic layer, you need to ensure data is consistent and correct throughout your systems, which also means undergoing the frequently painful process of a data governance project.
While data governance is hard, it pays infinite dividends that go far beyond the impact of making business intelligence much easier. But you won’t get there unless you ensure that your users have the training they need to be successful, and are using the tools as intended.