Last Saturday, March 11th, I attended the PowerBI user days. During the Saturday, PBIG offered a program of technical sessions, practical cases, and developments around Power BI. After the keynote, there were four one-hour sessions of five parallel tracks.
I attended the following sessions, all fascinating:
- A Citizen Developers Datawarehouse in Dataverse
- Advanced Power BI Refresh with Azure Data Factory (ADF)/Synapse
- One Power BI Dashboard to Rule them All
- PowerApps explained for Data Analysts
During this day, I also stumbled over a subject I did not know about, which seemed very interesting as someone who loves Information Technology. This subject was Data Virtualization. In this post, I would like to share the basics of Data Virtualization based on my research, which is only the tip of the iceberg, so be prepared for more extensive posts about Data Virtualization soon.
What is Data Virtualization?
Wikipedia describes Data Virtualization as a data management method that allows an application to retrieve data from multiple sources and to manipulate this without physically integrating them into a single repository.
It provides an abstraction layer that allows data to be accessed as if stored in one location, regardless of where it resides. This additional layer makes it easier for organizations to access and use data from disparate sources, such as databases, Data Warehouses, cloud-based applications, and even big data platforms.
Data Virtualization technology creates a virtualized layer between data consumers (applications and users) and data sources. The virtualized layer acts as a mediator that translates requests for data into the appropriate format and sends them to the relevant data sources. The data is retrieved, transformed, and returned to the application as needed.
One of the critical benefits of Data Virtualization is that it enables organizations to create a unified view of their data, even if you spread that data across multiple sources. The unified view makes it easier to analyze and derive insights from the data and share it with others within the organization.
Data Virtualization allows you to make split-second decisions based on real-time data across your organization. And this will reduce complexity and costs while increasing flexibility which seems like a pipe dream for many business managers, but nowadays, this form of data utilization is already possible. For instance, create an integrated customer view by combining real-time data from your ERP system, CRM solution, and webshop for Next Best Action analyses in seconds.
Data Virtualization is a scorching topic and technological development that puts an often traditional, ETL-based (Extract, Transform, Load) approach to Business Intelligence and Data Warehousing in a different light. Gartner’s research shows that Data Virtualization is not a trend, and data Virtualization is a tool for organizations to become more data-driven.
Five value factors drive the usefulness of Data Virtualization:
- Direct accessibility of data
- Integration of data sources in just a few clicks
- Real-time availability of data
- Data Virtualization is affordable
- Possibility of working Agile
I will discuss every factor in the following paragraphs of this post.
Direct accessibility of data
In a Data Virtualization platform, data access, modeling, and usage come together in one platform. One centralized platform gives you instant insight into what data you use for what purpose, who has access to what data, and the direct traceability of data. It offers managed or governed self-service data integration.
You can achieve direct accessibility of data with Data Virtualization by providing a single, unified view of the data to applications and users, regardless of where the data is stored.
You can accomplish this by using a Data Virtualization platform that provides a layer of abstraction between data consumers (such as applications and users) and data sources. A Data Virtualization platform should be able to integrate with a wide range of data sources, including databases, Data Warehouses, cloud-based applications, and big data platforms.
Once you have set up the platform, applications, and users can access the data directly through the virtualization layer without worrying about the underlying data sources. This virtualization layer can make it easier for developers and analysts to access the data they need for their applications and analyses without having to write complex code or query different data sources separately. When you create a Data Virtualization platform, it is vital to keep an eye on below features:
Support for a wide range of data sources
Your Data Virtualization platform should be able to integrate with a wide range of data sources, including both structured and unstructured data, and provide a unified view of the data to applications and users.
Real-time access
Real-time access to data on your platform is necessary so that applications and users can access the latest data without any delays.
Security and governance
Data security is not the sexiest part of building a platform but is essential. You should provide robust security and governance features to ensure that data is accessed only by authorized users and that the platform follows data privacy regulations.
Scalability
The amount of available data in organizations is growing at an unprecedented rate. The expected data created and replicated yearly will reach 175 zettabytes (ZB) by 2025, up from 33 ZB in 2018, representing a compound annual growth rate (CAGR) of 61%.
Several factors drive data growth, including the increasing use of digital technologies such as social media, mobile devices, and the Internet of Things (IoT). These technologies generate vast amounts of data that organizations can capture, store, and analyze to provide insights into customer behavior, market trends, and operational performance. The growth of data presents both opportunities and challenges for organizations. On the one hand, it provides a rich source of insights that organizations can use to improve business performance and drive innovation. On the other hand, managing and analyzing such large amounts of data can be overwhelming, leading to issues such as data silos, data quality problems, and difficulty in accessing and integrating data from different sources. Data Virtualization can enable organizations to access and use their data effectively, regardless of where it resides. There is a critical condition for your Data Virtualization platform: the platform should be able to handle these large volumes of data and process requests quickly and efficiently, even as the amount of data and the number of users grows.
Integration of data sources in just a few clicks
Using a Data Virtualization solution opens up a world of data to you by allowing you to integrate data from operational systems, NoSQL solutions, APIs, and Big Data with data from your Data Warehouse in just a few clicks. Using an ETL solution for this often forces you to buy expensive modules. You can achieve this using a Data Virtualization platform that provides an intuitive user interface for data integration. Key features of this platform should be:
- Drag-and-drop interface: your platform should provide a drag-and-drop interface that allows users to easily connect to and integrate data sources without writing complex code or scripts.
- Pre-built connectors: a user-friendly Data Virtualization Platform should come with pre-built connectors. Pre-built connectors enable users to connect to various data sources, including databases, Data Warehouses, cloud-based applications, and big data platforms.
- Automated mapping: the platform should provide a computerized mapping of data structures and metadata across different data sources, making integrating data from disparate sources easy.
- Real-time updates: The platform should be able to provide real-time updates of data changes across different data sources, ensuring that the integrated data is always up-to-date.
By providing an easy-to-use interface for data integration, Data Virtualization can help organizations streamline the process of integrating data from different sources, reducing the time and effort required to build data pipelines and workflows. This interface can enable organizations to access and use their data more effectively, leading to better decision-making and improved business outcomes.
Real-time availability of data
Real-time is the new norm. Because you access a source system directly from a virtual layer, you enable real-time reporting, analysis, and decision-making, unlike a Data Warehouse, where you often have to deal with batch processing.
You can achieve real-time availability with Data Virtualization by combining caching, federation, and replication techniques.
Data Caching
Frequently caching stores accessed data in memory or disk storage for quick retrieval, reducing the latency of fetching data from the source. The approach of data caching can provide real-time access to constantly changing data, such as stock prices or weather data.
Data Federation
Data federation combines data from multiple sources on the fly without physically moving or replicating data. Data federation can provide real-time access to data from disparate sources, such as Data Warehouses, cloud-based applications, and big data platforms.
Data Replication
Data replication involves copying data from one source to another, either in real-time or periodically, to ensure that the data is always available. Data replication can support you in creating a data backup or provide real-time access to constantly changing data, such as customer transactions or inventory levels.
Data Virtualization is affordable
Data Virtualization solutions are cheaper to purchase and maintain than traditional ETL solutions. Creating a virtual layer decreases the cost of ETL, data replication, and physical data storage. By making Data Virtualization affordable, you can consider below considerations:
Open source solutions
Several open source Data Virtualization solutions are free to use and can provide many of the same features and benefits as commercial solutions. These solutions may require more technical expertise to set up and maintain, but they can be cost-effective if your organization has a limited budget.
Cloud-based solutions
Cloud-based Data Virtualization solutions can be more affordable than on-premise solutions, as they typically require a less up-front investment in hardware and infrastructure. Cloud providers also offer pay-as-you-go pricing models, allowing organizations to only pay for the resources they use.
Hybrid approaches
Organizations can use a hybrid approach to Data Virtualization, combining on-premise and cloud-based solutions to balance performance, security, and cost considerations. For example, you can use on-premise data virtualization for sensitive data that your organization needs to keep in-house and cloud-based data virtualization for less sensitive data that you can store in the cloud.
Scalability considerations
It would be best to consider the scalability of Data Virtualization solutions when evaluating their affordability. Scalability refers to the ability of the solution to handle increasing volumes of data and users over time. Choosing a scalable solution can help organizations avoid costly and disruptive migrations to new systems in the future.
Total cost of ownership (TCO) considerations
Every organization is managed based on a financial foundation: budgets. It would help if you considered the total cost of ownership of Data Virtualization solutions over their lifecycle, considering factors such as licensing fees, maintenance and support costs, and training and expertise required. Choosing a solution with a low TCO can help an organization maximize the value of its Data Virtualization investment.
Possibility of working Agile
Because you significantly reduce the complexity in a Data Virtualization environment, you can work more Agile. You can deliver new insights up to 5 times faster while reducing development and maintenance costs by 50%, and data Virtualization can do this by mainly automating manual operations. If you want to utilize an Agile approach to Data Virtualization, you can set up the path below:
Identify business needs and requirements
Begin by identifying the business needs and requirements that Data Virtualization can help address. The identification process can involve working closely with business stakeholders to understand their pain points, data needs, and desired outcomes.
Build a cross-functional team
Form a cross-functional team that includes data experts, business stakeholders, and IT professionals. This team should work together to define user stories, prioritize tasks, and manage sprints.
Start with a minimum viable product (MVP)
Begin by developing a minimum viable product (MVP) that addresses the most critical business needs and requirements. An MVP can help ensure alignment with business priorities and the quick delivery of business value.
Adopt an iterative approach
Adopt an iterative approach to development, using sprints to develop and test new features and functionality. This approach enables organizations to respond to feedback and adjust as needed quickly.
Leverage automation and pre-built connectors
To accelerate development and reduce errors, leverage automation and pre-built connectors, including using tools for data integration, mapping, and quality management.
Focus on data governance
Ensure that data governance is a vital part of the development process. Establish clear policies and standards for data management and ensure that data quality is a top priority.
Measure and optimize: Finally, measure the success of the Data Virtualization solution over time and make continuous improvements based on feedback and performance metrics.
Final Thoughts
When you start with Data Virtualization, building your foundation on a robust and scalable Data Virtualization platform that handles large volumes of data and processes requests in real-time is essential. The platform should also be able to integrate with a wide range of data sources and provide a unified view of the data to applications and users.
Shout out to my friends at Bottleneck-it for inviting me and my colleague Danny de Wijs to the event. I had a great and very inspiring day at Jaarbeurs Utrecht with many new insights into PowerBI, Azure Data Factory (ADF), the DataVerse, and the PowerApps all complement each other. I will make sure to integrate these new insights and inspiration into the blogs I’m going to post soon. Feel free to contact me if you have questions or in case you have any additional advice/tips about this subject. If you want to be kept in the loop if I upload a new post, subscribe, so you receive a notification by e-mail.