A marketing executive walked into her organization’s IT group and told the first person she saw that she needed help. All she wanted, she said, was to export her WordPress site’s subscriber data into an Excel spreadsheet. Simple.

That task wasn’t simple at all, not with the scripts someone in her business unit had his young son install. The IT group’s manager tells the story and adds in a surprising burst, “I hate shadow IT!” 

Business units resort to shadow IT when their organization’s central IT departments say there’s no budget, no time, and, frankly, no interest. “Wait,” says IT.

But these desperate units can’t wait—they’ve got businesses to run. So semi-skilled techs within the group do what they can, which is plenty, thanks to the wave of ever easier-to-use solutions.

That might be fine, except for the inevitable disruption, such as the marketing executive’s WordPress site. Even worse, it exposes entire organizations to ransomware’s hot breath.

What could possibly resolve this dilemma?

A slew of academic research has pondered shadow IT, and most researchers seem to come out of the “I hate shadow IT!” corner. Judging from their tone, they view shadow IT systems like a basement nest of rats. Solutions amount to IT’s equivalents of poison, traps, and predators.

A friendlier tone came from a recent Gartner Inc. webinar, “CIO Strategy: Manage the Opportunities & Risks of Democratized Business-Led IT.” One slide read, “CIOs can no longer ‘own’ the entire organizational technology estate.” Correct. “And there is danger should they try to!” Agreed. “Half of business unit leaders with technical capacity think they have a better understanding of requirements they need than the IT organization.” Probably so, and perhaps for good reason.

The Rebel

One 30-year veteran of data management consulting comes out from shadow IT’s corner swinging hard. “How can you be angry at shadow IT?” said Evan Levy, partner with IntegralData, and more recently of SAS Analytics. For more than 30 years, he’s been one of the IT industry’s most sought after data management consultants.

“You can be angry at people breaking certain types of data guidelines,” he said, “but that’s probably because you haven’t invested in governance. You haven’t invested in frameworks for people to be more self-sufficient.”

The crux of the problem, he said, is that IT departments have often failed to create new business models that respond to evolving needs. IT is usually so focused on the old model that they’re not supporting the business.

Business units are justified in telling IT, “I’ve got a business responsibility to address. I’m not here to conform to your standards. You need to conform to mine!”

In Levy’s view, IT has always been in the role of putting itself out of business. Its original mandate, he said, was to reduce costs by automating business processes. It succeeded. But in the decades since, things evolved away from mass automation and into custom business process enablement. So, while IT still runs the big automation systems, newer technology and the business needs they serve have outrun IT.

Central IT isn’t even structured to serve those newer needs, he said. It’s typically managed by budgets, not customer initiatives.

“IT opened up Pandora’s box when they couldn’t respond to the needs that users had,” Evan explained.  They can’t do a one-off report or load a one-off data source for a one-off user.

One particular peril of the shadow IT vs. central IT struggle is the rising chance of weakened security. Business unit techs are often naive about it. The IT department is, of course, shocked. But sometimes even IT departments themselves fail.

The On-Call Consultancy

At least one veteran consultant who straddles the shadows and central IT’s bright lights finds that both sides in many organizations need improvement.

Dan Murray, director of strategic innovation at tech consultancy InterWorks, said that old-school thinking is still pervasive. The IT group, especially in midsize organizations, may have three people, and they’re all the kinds of guys who could fix the printer 10 years ago. “You’ve got to stay up with tech. It moves.”

Even worse, some central IT departments fail to appreciate security’s urgency. Among the common myths he hears is that the company is too small to be a target.

“You think you’re a small backwater?” said Murray. “You’re the ideal target. If I was in that business, I’d be going after you first because you’d be easy pickings, and you’d probably pay. And we’d be sensitive to how much you could afford.”

If central IT is close to outliving its era, and business units increasingly find themselves forced to serve themselves, the happy resolution points away from any central IT group and away from shadow IT. Instead, IT requirements might be met best by a group of consultants who can help as needed.

The IT manager who “hates shadow IT” and the marketing executive with the WordPress site are far from unique. Everyone seems to hate shadow IT. But each side hates it for different reasons. Without it, business units resort to do-it-yourself solutions outside of their expertise.

With it, central IT has a harder time producing one-size-fits-all solutions.

The best scenario is a transformation of IT into a service bureau. Another solid option is to hire an agency, similar to what Murray describes. Whichever way a company decides to go, the key is to shine the light on all shadow IT in your organization, and remove every trace of it.

The global big data analytics market is expected to reach $68.09 billion in annual revenue by 2025. It has become one of the most compelling technology trends reshaping business processes and operations. The category is now a crucial resource for both the public and private sectors, as well as for the healthcare industry.

Organizations can now track and analyze business data in real time, thanks to cloud-based applications. This allows for quicker turnaround time for adjustments. As we move into 2023, now is the time to anticipate, adapt, and scale your approach to big data by preparing for the coming year’s top trends in data and analytics.

Trend #1: Digital Transformation

Digital transformation pushes technology from every direction as we move deeper into the Internet of Things (IoT), machine learning, and big data. With IoT devices expected to move from 9.7 billion in 2020 to 29.4 billion in 2030, it’s easy to see how big data will play a pivotal role.

Artificial intelligence (AI) will become a bigger part of processing, trying to make sense of the vast amounts of data we’ve accumulated in the years past. We’ll continue to find applications for projecting insights on how we can grow in the future.

Trend #2: Business Intelligence

Business intelligence will continue moving into every industry and affect both strategic and tactical business decisions. Business intelligence (BI) leverages software and services to provide actionable insight to give users detailed intelligence about the state of the business.

With the value of the global BI and analytics software market expected to grow to $18 billion by 2025, growth may come in unexpected places. Collaborative BI can help unearth insights and make decisions without moving from a BI platform. It can make data more accessible and provide non-technical users an easy way to connect to the results.

Trend #3: Cloud Technology

More than 70% of companies have migrated at least some of their workloads to the public cloud, with even greater cloud adoption predicted in the next few years. A McKinsey survey found that companies averaged 23% over budget on cloud spending, and estimated that 30% of their outlays were wasted.

Cloud-native technologies are the only sustainable solution for our ever-changing business and IT infrastructure. They’re complicated, though, requiring standardized and self-service tools that allow even more non-technical users to better collect, analyze, and interpret usage and data.

Trend #4: Data as a Service (DaaS)

Data as a Service (DaaS) has reached a point where even the smallest players can easily get into the game and generate revenue from it. With growth predicted to approach $11 billion in revenue by 2023, it’s expected to allow extreme niches to find value in their data.

If you have a company whose data could be of value to others, this could become a nice revenue stream. Now is the time to rethink what you’ve offered in the past, and find new ways to be relevant in the coming world.

Trend #5: Health and Wellness

More businesses are taking an active role in health and wellness. Examples include medical practices that are finding new ways to communicate with patients, and companies’ efforts toward identifying strategies to help keep workers healthy.

This trend can lessen the burden on our current medical system. The U.S. National Health Expenditure grew to $4.1 trillion in 2020, or $12,530 per person, accounting for 19.7% of GDP. By continuing to accumulate medical data from all across the planet, companies can use that data in new ways to identify cures sooner and faster than ever before.

Trend #6: Driverless Technology

While autonomous driving has been talked about for years, we’re finally getting closer to making it a reality. Waymo, Alphabet’s autonomous vehicle unit, now uses self-driving ride-hailing services in both its Phoenix and San Francisco locations. Walmart has used an autonomous vehicle program in Arkansas since 2020, and has registered more 70,000 miles.

As businesses continue using data derived from these early adapters, the only place to go is up. Big data will continue to help further define commuter transportation management programs, which should help shed light on how we’ll continue to move in the future.

Data Is at the Heart of It All

Data analytics is the heart of all future development.

As businesses race to keep pace with the significant developments in digital transformation, one thing is clear: business intelligence will play a key role in future enterprise growth.

Businesses, big and small, that harness new technologies as they enter the market, utilizing essential self-service analytics tools and big data, will be better equipped to predict trends, identify business growth opportunities, and enhance profitability.

A data lake is a centralized repository where you store structured or unstructured data without having to transform it first. In other words, you simply store it as-is and use it for data analysis. You can store any type or volume of data in full fidelity, whether the data comes from on-premises, cloud, or edge-computing systems. This provides tremendous flexibility in storage on a scalable and secure platform.

A data lake allows you to process data in real-time or batch mode and analyze that data using SQL, Python, R, or any other language, third-party data, or analytics application.

The Benefits of a Data Lake

There are several key reasons organizations choose data lakes.

Lower TCO

The cost to store data in a data lake system is typically significantly lower than databases, which often require complex infrastructure and filtering. With a data lake, you only pay for the storage space. For organizations that store mass amounts of data, this can reduce the total cost of ownership (TCO).

Companies don’t need specialized hardware for a data lake, and data sets don’t need to be indexed and prepped before storage.

Simplified Data Management

When organizations deploy data lakes, they can eliminate the data siloes that often exist. With a central repository, companies can also avoid the challenge of having to move data between data warehouses and data centers.

Used as a separate layer, a data lake can mitigate the costly egress fees – fees paid to transmit data out of a cloud platform – required to move data from one cloud provider to another for processing.

Flexibility

Traditional data warehouse platforms are schema-based, meaning that your data has to be stored in a specific format. When you set up your database, you have to decide its structure. Before data is added, it has to be cleansed and standardized. If new types or data formats are added, you may have to rebuild the database to accommodate it. With a data lake, you have flexibility because you can store any type or format of raw data.

Data that doesn’t fit squarely into a database slot, such as social media posts, customer support notes, images, sensor data, and other unstructured data, can still be available for data scientists.

Data Democracy

A data lake makes data available across an enterprise to any authorized user. This allows companies to build a more data-centric culture and provide middle managers and frontline employees the data access they need to make better decisions.

Accelerate Data Analytics

Data warehouses generally rely on SQL, which may be fine for simple analytics. However, more advanced data analytics may require more flexibility in assessing data. Data lakes allow for more options. For example, you can store data from multiple sources in multiple forms as raw data for data scientists.

Data is then prepared for applications such as artificial intelligence and machine learning, predictive analytics, and data mining.

Challenges with Data Lakes

While there are plenty of benefits, challenges also exist regarding data lakes that organizations need to overcome.

Data lakes make it easy to save everything. While storage is more cost-efficient in a data lake, it’s still an expense. Further, constantly evolving and generally complex legislation around the world is adding new challenges associated with storing certain kinds of potentially sensitive data. Organizations need to put parameters on what should be saved. This requires data governance policies.

Data Governance

While data doesn’t require cleaning or transforming to be stored in a data lake, an organization still needs strong governance to ensure data quality. This includes:

  • Policies and standards
  • Roles and authentication
  • Data processes
  • Data management

Data governance covers every aspect of storing and securing enterprise assets to ensure quality and accountability.

Major Data Lake Providers

The list of major data providers includes the major players in the cloud service provider (CSP) space, including:

  • Amazon Web Services (AWS)
  • Microsoft Azure
  • Google Cloud Platform
  • Hewlett Packard Enterprise (HPE)
  • IBM Cloud Computing
  • Oracle
  • Snowflake

As you can see, data lakes have significant advantages in modern environments. But knowing their strengths and weaknesses before investing a dime is crucial to getting the most out of them. Make sure they’re a good fit for your environment and business objectives first.

When it comes to Business Intelligence (BI) products, most of the reporting tools in the space presume that your organization has a data warehouse in place.

A data warehouse is most commonly a relational database with data ingested from transactional systems (also known as online transaction processing, or OLTP, systems) on a regular cadence or even in real time.

The action to ingest the data into a data warehouse is called an extract, transform, and load (ETL) process. ETL processes were traditionally run daily, but modern data demands typically require them to be run more frequently—in some cases, even streaming real-time data into the data warehouse.

Business analysts, data engineers, and organizational decision-makers access this data using BI tools like Power BI, Qlik, and Tableau, along with other analytical applications that make use of the data.

The Star Schema

Data in a data warehouse is commonly organized into tables within a single database. At the core of most data warehouses is a “star schema,” in which data is organized into “facts” and “dimensions.”

Facts represent quantitative pieces of information, like a sale or a website visit. These fact tables have foreign key relationships to dimension tables. Dimensions are related to the facts and describe the objects in a fact table. A sample star schema is shown in Figure 1.

Figure 1: A typical star schema in a data warehouse

 

This data model (also known as a dimensional model) is very efficient for queries that read data—most data warehouses also optimize for the bulk loading of fact data.

Data warehouses are known as online analytical processing (OLAP) data structures, as opposed to OLTP databases. This structure is a trade-off, which comes at the expense of storing redundant data in this structure (keys are stored in both fact and dimension tables) as part of this denormalized design. The queries against the dimensional model are simpler (fewer joins), and many metrics are simple aggregates (sum, average, maximum) of columns in the fact tables.

Dimensions which are loaded less often are also designed to deal with changes in data over time. Some examples of this include names of places, customers, or products. This concept, called “slowly changing dimensions,” allows for queries to historically analyze data while considering description changes.

While there are other types of schemas used in data warehouses (for example, the snowflake schema), star schemas make up most data warehouse workloads.

Since by its nature the data warehouse aggregates data from a variety of systems, its databases can become very large. Most data warehouses are terabytes in size, and it’s common for them to reach 50 TB or more. This can require a great deal of processing power, so many data warehouse systems use scale-out approaches to improve query performance. On-premises systems include Teradata and Netezza, while cloud offerings include Amazon Redshift, Azure Synapse Analytics, and Snowflake.

The systems all distribute fact data across multiple compute nodes, allowing aggregations to receive help from massively parallel processing (MPP). While the MPP products draw a lot of attention and marketing dollars, most data warehouses run within a single server.

Data Warehouse vs. Database, Data Lake, and Data Mart

These terms are often used interchangeably, and many products in the space support using all the terms. Here’s a breakdown of each one.

Data Warehouse vs. Database

Most data warehouses are built on top of relational database engines (RDBMS), which are commonly referred to as database servers. While some NoSQL solutions are not built on an RDBMS, most data warehouses operate as a database on an RDBMS.

The other confusion that sometimes comes into play is referring to databases in the sense of OLTP systems only. This is inaccurate—if your data store has tables and runs on a RDBMS, it is indeed a database.

Data Warehouse vs. Data Lake

This distinction gets a little bit more confusing because most modern data warehouse systems support ingesting data from a data lake.

A data lake is typically an object store (think Amazon S3 or Azure Blob Storage) into which substantial amounts of files get written. Unlike a data warehouse, a data lake does not have predefined schemas.

Some data warehousing systems are built around data lake—Databricks is good example. Other data warehouses support ingesting data from the data lake through the ELT process. Typically, “external” tables (which are only metadata objects that don’t store any data) get created in the data warehouse. From those external tables, data is ingested to regular tables within the data warehouse.

Data Warehouse vs. Data Mart

A data mart is also a dimensional model, but it represents a subset of the total data warehouse. Typically, this data is related to a single line of business, department, or geography. Since a data mart deals with a smaller data volume, it allows the team to run more in-depth queries over the data without impacting other users. The main difference between a data mart and data warehouse is that the scope and size of the data is much smaller in a data mart.

Data Warehouse Benefits

The biggest advantage of having a data warehouse is that it puts all your business’s data in a single data store, providing a single version of the truth.

This is important because reporting against operational systems is challenging in terms of both performance and data quality. While real-time reporting has it uses, reporting queries can block sales transactions, or show pending transactions that may be cancelled. Beyond those challenges, OLTP schemas are designed to optimize writing data rather than reading it.

Having a simplified star schema is good for both query performance and organization—and they are well understood by business intelligence tools, which makes things easier for business analysts.

The structure also provides very good query performance, even as data volumes ramp up. This has been helped by the adoption of column storage for most fact tables, but even older versions of data warehouses performed efficiently.

Data warehouses also allows for including data quality and master data services as part of the data ingestion process, which results in better data quality. All of this means your business users can get faster, better answers which can lead to actionable insights and a healthier bottom line.

Better Outcomes

Data warehouses large and small are at the center of most business intelligence architectures. By offloading reporting from operational systems into a read-optimized system that’s easy for analysts and data scientists to query, your business can make better decisions. In addition, the performance optimizations and scalability available through MPP data warehouses mean that no matter how much data you have, the data warehouse can meet your performance needs.

Artificial intelligence (AI) and machine learning (ML) are constantly being used interchangeably in the tech landscape. But are they the same thing?

The answer is No. They both combine machines with the study of cognitive science in an attempt to mimick how the human brain performs thought, learning, and mental calculations. However, they have distinct qualities that invariably make them two separate concepts. Let’s take a look.

Artificial Intelligence vs. Machine Learning

AI is the field of computer science that aims to study and create machines that can exhibit human behavior independently, without human control.

ML is a process that allows machines to learn from data without explicit directions, using algorithms. It essentially involves an algorithm—a formula representing the relationship between certain variables—and a model, which is a program that can find patterns or make predictions on new, incoming data.

AI is a field of computer science, the same way cybersecurity, software engineering, human-computer interaction, data science, and many more are fields of computer science. ML, on the other hand, is a process in which computers learn, and is a subset of the AI umbrella.

Types of AI

Weak AI

“Weak AI” is AI trained to perform very specific tasks, and is the most commonly used AI in our everyday products. Amazon Alexa, IBM Watson, and Tesla Autonomous Driving Software all fall under this type.

Strong AI

“Strong AI” is mostly theoretical. It obtains the same (or greater) computational capabilities as a human, with self-awareness and the ability to plan for the future. An example of strong AI would be the fictional program Jarvis in Marvel’s Iron Man.

Machine Learning Methods

There are many different learning methods for ML algorithms, but they all boil down to three overarching methods.

·      Supervised Learning

Supervised learning is when models are trained on labeled input data, and the outputs are compared to the target (or desired) outcome, and used to determine how well the model performed.

Supervised learning has been used to detect spam. Given emails labeled as “spam” or “not spam,” it learns what would be considered spam. When given new emails, the model will then label them as spam or not and will be scored based on accuracy.

·      Unsupervised Learning

Unsupervised learning is used to describe a relationship within data. Instead of aiming for a target, unsupervised learning intakes unlabeled data and determines what relationship all the data elements may have.

Clustering is a form of unsupervised learning. When given data, the model will cluster or group data based on similarities it determines. An example of clustering is its use in grouping retail products. Given product description data, the unsupervised learning model will group the data based on features it determines on its own. It could end up grouping products by color or size or some other feature entirely.

·      Reinforcement Learning

Reinforcement learning doesn’t train on input data at all. Instead, it’s given a goal to achieve and actions it’s allowed to perform. The feedback received will tell the algorithm if it has reached the goal or not.

You can see an example of this in a game where a computer agent is given the rules of a game, a point system, and a goal to get the highest score. Through its trial and error of moves, it will eventually learn how to achieve the highest score possible.

Subsets of Machine Learning

Within ML, specific subsets focus on different learning designs. The two most well-known subsets are neural networks and deep learning.

Neural Networks

Neural networks is a branch of ML whose design mimics how human brain neurons process and pass on information. They’re made up of layers of nodes in which each node holds weight or value and passes data to other nodes in the next layer with little human interference.

They’re usually used to solve problems and detect patterns on unstructured, non-linear, or otherwise complex data.

Deep Learning

Deep learning is what we call the group of algorithms that use neural networks to solve problems. Specifically, it uses a neural network with at least three layers. In basic ML, humans prepare the data in some way before feeding it to the model.

Deep learning, however, takes in mostly unprocessed data and learns for itself which features are important in making correct predictions.

An amazing example of deep learning is its use in AI-powered chatbots. These chatbots use a combination of natural language processing (NLP) and deep learning to understand human language.

NLP is a branch of AI devoted to teaching machines how to understand text and voice, and to respond on their own, ultimately mimicking human conversation. For AI-powered chatbots, deep learning can be used to assist in NLP goals by helping NLP systems learn the meaning of words in real time.

Breaking It Down

The main but subtle difference between AI and ML is mostly hierarchy. AI is a study and field of computer science that encompasses any technology that attempts to mimic or surpass human intelligence. ML is a subset of the AI field, and describes a process of how machines can learn patterns and make predictions on data.

The traditional data analytics architecture was pretty straightforward—organizations had a number of operational databases, which were nearly always relational databases, with the occasional file share.

An extract, transform, and load (ETL) process would sweep data from those systems as part of a batch process. The batch would typically run overnight, but in some cases more frequently—for instance, every four hours. The end result of that ETL process would be a star schema consisting of fact and dimension tables in a data warehouse, where business users ran predefined reports.

The data warehouse would be another relational database, and most of the reports would be defined as stored procedures within that database.

What Changed?

Traditional data warehouse architecture is still good—in fact, star schemas are in the middle of a number of very modern data analytics architectures. However, a couple of factors have dramatically increased the volume of data—mobile phones and social media being the two biggest technical factors.

Whether it’s ingesting data from mobile applications or using social media streams for sentiment analysis, both the volume and the velocity of inbound data changed dramatically in the 2010s. The popularity of Big Data led to the adoption of systems like Hadoop, and the subsequent development of Apache Spark, which has become a mainstay of the modern analytics architecture.

By contrast, traditional ETL processes struggled to manage the volume of these sources, and businesses also demanded closer to real-time reporting—it was no longer enough to see yesterday’s sales.

The other factor that changed everything was the broad adoption of cloud computing. One of the earliest features in the public cloud was object-based storage—in the form of AWS S3 and Azure Blob Storage, which allows files to be created as http endpoints, and allows nearly infinite amounts of data to be stored.

Object storage is at the center of most modern “data lake” architectures, as it allows data to be ingested quickly at large scale. Also, typically as opposed to the traditional ETL process, data in object storage is subject to “schema on read” or extract-load-transform (ELT) processes where the transform happens virtually, and as the data is ingested into its persisted store. This is typically a cloud data warehouse, which comes in a number of forms.

How the Cloud Makes Everyone an Enterprise

As data volumes grow, scaling a data warehouse out is one of the better approaches for supporting those volumes—memory is a big constraint, and there are limited amounts of memory that will fit into a single server.

Before the advent of the cloud, scale-out—or, as they are specifically known, massively parallel processing (MPP) data warehouses—involved buying and licensing an expensive set of hardware from Oracle, Teradata, Microsoft, or other vendors in the space.

Now the cloud offers the ability to rent that hardware at small, large, or even changing scale. Services like Amazon Redshift, Snowflake, and Azure Synapse Analytics all make that possible. And there are other vendors with similar offerings built on non-traditional engines like Databricks, which are built on top of Spark.

It’s not so much that these services are cheap—the top SKU of Azure Synapse Analytics costs $300,000 per month, although the cost of entry is much lower—but that they’re more cost effective.

This is especially true of upfront costs, which can be orders of magnitude less with the cloud. Instead of buying a rack of servers and storage, you can pay a cloud vendor a few hundred dollars a month and get started with something that would have taken millions of dollars and multiple months, or even years, to get started with in the pre-cloud world. On top of that are the (often) hundreds of thousands of dollars of services costs typically associated with an on-premises data warehouse implementation.

Another key technology change that allowed for large data volumes while simultaneously improving overall performance is the proliferation of columnstore storage for data warehouse workloads.

While the implementations are slightly different between different databases, the technology basics are the same. Typically, a database stores data on pages (depending on the RDBMS, these are somewhere between 4KB – 1MB, though generally smaller). Each page is made of data from the table across all columns, which allows for easy single row updates and inserts.

Since the data on these pages contains data from a lot of different columns, when you apply compression to tables and index, the rate of compression is typically 30%. Also, both the number of columns and the selection of those columns have to be carefully chosen for indexes in order to balance out read and write performance. This makes traditional database indexing both an art and a science.

However, data warehouses don’t typically do a lot of single row operations—inserts and updates take place in batch operations. Columnstore indexes store their data grouped by columns. Since data in columns have much more data in common (for example, the number of customers with a state name of “California”), it allows for much greater levels of compression.

Additionally, each column effectively becomes its own index. This means columns not referenced in a query are simply not read, reducing the overall IO impact of any given query and allowing much more efficient usage of memory. This concept even applies to data lakes, as Parquet files use a similar storage construct.

Building the Modern Analytics Platform

There are a number of aspects beyond cloud data warehouse and object storage that make up most modern analytics platforms (Figure 1).

Figure 1: A modern data analytics platform

One key difference is that ETL developers have largely been replaced by data engineers, who perform a similar function of distilling raw data into useful information. These engineers, though, are expected to support more data sources and have access to APIs.

The other role that has changed the way data is consumed in enterprises is the data scientist. They consume datasets that are typically produced by data engineers, and perform experiments with various machine learning techniques.

You Say You Want a Data Revolution?

Data volumes continue to increase over time, for a number of reasons. These include more disparate data sources, more sensors for data collection (for instance, Internet of Things (IoT) devices), and well as longer data retention times.

The adoption of both columnstore and massively parallel processing systems, and the lower cost of entry for organizations of all sizes, has led to the wholesale adoption of modern data analytics technologies.

Beyond the data warehouse, Spark allows more modern data paradigms like streaming and machine learning.

The broad adoption of the public cloud, and the fact that the volume and velocity of data under analysis continues to grow across all industries, means that these modern analytics paradigms will continue to grow and evolve.

The general trend of the evolution has been that even though system complexity and power has continued to grow, the ease of deployment has become easier. That doesn’t often happen in the IT world.

Self-service business intelligence (SSBI) has long been the holy grail of data warehouse architectures. Many business intelligence software vendors like Tableau, Qlik, and Microsoft (Power BI) like to make SSBI appear to be as easy as getting data into a data model and handing it off to business analysts.

The reality is that delivering a data set that allows business users to build their own reports and dashboard requires a great deal of planning, architecture, and governance to ensure that users can:

  1. Understand the data values they’re looking at
  2. Manage the created reports and dashboards to ensure their data integrity
  3. Have consistent understanding of data, and its values across the entire company
Building a Self-Service Business Intelligence Architecture
Figure 1: Self-service business intelligence architecture

Figure 1 provides a high-level view of SSBI design. There are many variations on this theme, but at its core, business intelligence gets data from source systems that provide operational functionality.

It then uses a process of extraction, transformation, and loading to ingest that data into a reporting-optimized pattern in a data warehouse (which is traditionally another database itself), called a star schema.

The star schema is made up of facts and dimensions—a fact table, for example, would house individual sales data containing keys referencing products, dates, and geography information. These keys can be joined to dimension tables further describing those properties.

While getting this data into a data warehouse allows BI developers to easily build reports on request without impacting source systems, data in a data warehouse isn’t usually at a point where business users can start using it.

For that to happen, a semantic layeris created, which helps translate esoteric column names like ProductID and EnglishProdNm into terms like Product SKU and Product Name that business users can more easily understand.

This layer is also used to help define key metrics and calculations for important business questions, which allows a consistent definition for those questions. When the semantic layer is correctly defined, it allows business users to quickly and easily drag-and-drop data elements into a reporting tool and be assured that the data is correct.

 

Reducing Complexity

Another goal of the semantic layer is reducing complexity within the data to simplify reporting for business users. Those SSBI users shouldn’t have to understand the difference between inner and outer joins or the basics of SQL syntax.

That simplification can be done through things like dimensional modeling, ensuring data consistency, and making sure the SSBI users can’t accidentally bring down the data warehouse with a couple of bad reports.

And those users shouldn’t have to be concerned with the complexities of your dimensional model. You should work to simplify the presentation of your physical data sources within the self-service model to make life easier for them.

 

Governance

While having a semantic layer and handing off reports to customers is critical, the most important part of any reporting effort is data governance. Data governance includes rules about how data is sourced, cleansed, managed, distributed, and used in your BI systems.

Data governance can also include a master data list. This helps resolve data conflicts of the type that can happen when, for example, a customer address is entered two different ways in two different systems.

It can be hard to kick off a data governance effort, especially if the company is undergoing a business intelligence project. Don’t let those challenges deter you, though—the value of governance is in ensuring that data can be trusted.

Having trusted data means better data and easier reporting. Having quality data that’s consistent will help increase the adoption of your SSBI effort, because analysts will be more apt to use those tools, instead of relying on their own legacy Excel spreadsheets complete with esoteric formulas that aren’t centrally managed.

Beyond the benefits for your business intelligence reporting, data governance has many benefits around security and metadata management that can help your organization become more data focused.

 

Training and Process

Building all the best tools doesn’t mean that your users will immediately find them intuitive and easy to use. The most commonly used business intelligence tool is still Excel—many business and finance users are most comfortable using spreadsheets, and will take some coaxing to use more modern BI tools.

Your users need training to get there, of course. But it goes beyond just learning new tools they’re using—they need to learn how the data works, so they can start building and expanding their own models as they get more comfortable working with their own data.

Training users isn’t as simple as running a one-time weeklong training class—you need to mentor users and ensure they’re getting the benefits of the tools and processes in place.

Another consideration is that most SSBI tools provide metadata on usage. It’s important to understand how, and how often, data is being used within the organization. This facilitates gauging adoption and focusing mentoring efforts where utilization is lower.

 

Worth the Effort

SSBI vendors frequently sell the ease of use of their tools and platforms without mentioning the process complexity that’s involved in building those data models.

It’s important for both IT and business organizations to understand the significant effort level involved in a successful SSBI project. Beyond just implementing a data warehouse and a semantic layer, you need to ensure data is consistent and correct throughout your systems, which also means undergoing the frequently painful process of a data governance project.

While data governance is hard, it pays infinite dividends that go far beyond the impact of making business intelligence much easier. But you won’t get there unless you ensure that your users have the training they need to be successful, and are using the tools as intended.