In September 2021, Microsoft launched a beta version of a cloud-native data governance service called Microsoft Purview. The tool promised users the ability to govern their on-premises, multicloud and SaaS data. It’s been 2 years since that date, and I have been wondering if this “new” and promising solution, although now aged, can still be called promising and what the service provides.
Together with some friends from Sii, we conducted a PoC for our client. I want to share some of our findings with you.
You can create a Microsoft Purview account in the Azure Portal site by simply creating a new resource and searching for Purview in the Marketplace. During the creation process, you will be asked to specify the Resource Group and Purview Account name. Additionally, you can configure the network connection and the platform size of your account.
For testing purposes, “4 capacity units” is more than sufficient in terms of platform size. On the network tab, it would be good to provide a “Private endpoint” for security reasons, but “All networks” set up in a non-production environment is fine.
To visualize the data in the organization, we have to create a few collections and subcollections in a tree structure to depict the layout of the architecture holding our data. Below is an example of what we came up with for the PoC.
As you can see, at the top we have the name of the account we created, then a collection called “SII”, and under that we have several subcollections showing various subsidiaries of the company. Each subsidiary has an application running and a department that uses it. In our example “Dział_1” in “Jednostka_1” is based on Power BI reports, so we created a data source that will scan all Power Bi reports on that tenant to depict what reports hold what data. For a full list of data sources that can be created directly from the purview and scan, follow the link.
That same department in ”Jednostka_2” gets its data from a SQL database, so this is the data source we created and scanned to find out what assets(data) are stored there.
For the last subsidiary, we import assets directly from the Excel template using a custom application in Python that we are building. This is a solution for importing information on custom flat file reports and outputs that are not covered by the scans of data sources. We will discuss both scanning and template import topics in more detail later.
One way to automatically obtain data information from the environment is to add a data source to a particular collection and run a scan on it. In our example, we have created a Power BI data source that can be used to scan the same tenant where we have our purview account on for resources. The result of the scan in this case will show not only Power Bi reports, but also workspaces, datasets and dashboards.
In the details of a given asset, you will be able to see under which collection path and in which hierarchy in the Power Bi data source it is located. It will also be possible to check the lineage of the data and even the schema, if the asset can be displayed as a table.
Detailed instructions for creating this configuration you can find at this link.
Bulk import of assets
Since one of the most common things, you can find in any IT environment is an Excel-based report or a flat file of data that someone uses in their daily work, there must be a way to introduce this information and data structure into the purview. To do this, some coding in Python is required.
While doing this, you will find that underneath purview actually uses apacheatlas to import assets. The coding is fairly simple, and a tutorial on how to do this with examples can be found here.
For our PoC, we built a small application for loading Excel templates using apacheatlas and a graphical interface using customtkinter. The application shows what environment you are connected to, allows you to browse your PC for files/templates to upload and perform the upload itself.
What is really worth noting at this point is that the Excel template that must be used in this process requires a significant amount of work from the uploader. We need to take the report apart and know all the details about it to fill in the template, such as:
- source type,
- column names,
- column types,
- qualified names,
- column mappings etc.
To build the lineage, we either need to already have all the components (source system, target system, loading process) in Purview, or create them using the template.
The end result is decent, as we get a nice visualization, as in the example below. We can trace our “PESEL” field, which we have in the archive system on the right, to the SAP Hana View, which gets that value from the SAP table called “SII_02.” Performance, unlike the load of even a fairly complex structure of 14 assets, takes only a few seconds which is an adventage.
If you are interested in running some tests on your own, I’ve attached a simple code sample with two import files that can be used for bulk import into Microsoft Purview. The example does not include a graphical interface.
As you can see, we were able to use Microsoft Purview, build out the collections, create automated scans of data sources, import assets from flat files and trace data lineage in the tool. Now we can try to answer some important questions.
What we can do with purview? Where we can use it for?
Creating a master data catalogue, documenting what data we use in a report or showing via a scan to a controller that we indeed do not have personal or confidential data exposed in our system.
Is this tool easy to use for someone who wants to start creating a data catalog?
Unfortunately, the entry threshold is high for new Purview users. Setting up the scan requires a very good knowledge of Azure Cloud on the administrative side, and even the bulk import requires a basic knowledge of Python. Once all that is done, the tool is simple and easy, but creating a data catalog with this this solution is a very technical task.
Is it a production-ready product?
This service has been in beta for 2 years for many reasons.
- Automated scanning of data sources requires the participation of a technical person to set up even on the same tenant. Between different tenants with additional network security would require a whole team.
- Manual entry requires coding, so you need a developer or engineer to do this. Even at that, not all functions exist to easily work with collections and asset placement.
- An intelligent asset import (or import wizard) for unsupported data sources, such as an Excel report, does not exist.
- Data lineage even between scanned sources must be adjusted manually.
- Almost all major data sources are available in Purview, but they are not complete. For example: you can have tables from all SAP types, but the views and columns that you need to create for them are not available.
Would you recommend using Purview?
I think it is worth knowing about this tool and checking what updates are available for it. It is a good solution for managing data in the cloud, especially in large organizations, and its value will increase.
Nevertheless, I would not recommend it at this stage, as the amount of work required to get it up and running outweighs the benefits it brings.