Microsoft Purview Data Quality is a comprehensive solution that empowers governance domain and data owners to assess and oversee the quality of their data ecosystem, facilitating targeted actions for improvement. In today's AI-driven landscape, the reliability of data directly impacts the accuracy of AI-driven insights and recommendations. Without trustworthy data, there's a risk of eroding trust in AI systems and hindering their adoption.
Poor data quality or incompatible data structures can hamper business processes and decision-making capabilities. Microsoft Purview Data Quality addresses these challenges by offering users the ability to evaluate data quality using no-code/low-code rules, including out-of-the-box (OOB) rules and AI-generated rules. These rules are applied at the column level and aggregated to provide scores at the levels of data assets, data products, and governance domains, ensuring end-to-end visibility of data quality within each domain.
Microsoft Purview Data Quality also incorporates AI-powered data profiling capabilities, recommending columns for profiling while allowing human intervention to refine these recommendations. This iterative process not only enhances the accuracy of data profiling but also contributes to the continuous improvement of the underlying AI models.
By applying Microsoft Purview Data Quality, organizations can effectively measure, monitor, and enhance the quality of their data assets, bolstering the reliability of AI-driven insights and fostering confidence in AI-based decision-making processes.
Fabric data estate in OneLake including shortcut and mirroring data estate. Data Quality scanning is supported only for Lakehouse delta tables and parquet files.
Mirroring data estate: CosmosDB, Snowflake, Azure SQL
Shortcut data estate: AWS S3, GCS, AdlsG2, and dataverse
Azure Synapse serverless and data warehouse
Azure Databricks Unity Catalog
Snowflake
Google Big Query (Private Preview)
Important
Data Quality for Parquet file is designed to support:
A directory with Parquet Part File. For example: ./Sales/{Parquet Part Files}. The Fully Qualified Name must follow https://(storage account).dfs.core.windows.net/(container)/path/path2/{SparkPartitions}. Make sure we do not have {n} patterns in directory/sub-directory structure, must rather be a direct FQN leading to {SparkPartitions}.
A directory with Partitioned Parquet Files, partitioned by Columns within the dataset like sales data partitioned by year and month. For example: ./Sales/{Year=2018}/{Month=Dec}/{Parquet Part Files}.
Both of these essential scenario which present a consistent parquet dataset schema are supported. Limitation: It is not designed to or will not support N arbitrary Hierarchies of Directories with Parquet Files. We advise the customer to present data in (1) or (2) constructed structure.
Currently, Microsoft Purview can only run data quality scans using Managed Identity as authentication option. Data Quality services run on Apache Spark 3.4 and Delta Lake 2.4.
Out of box rules to measure six industry standards Data quality dimensions (completeness, consistency, conformity, accuracy, freshness, and uniqueness)
Custom rules creation features include number of out of the box functions and expression values.
Auto generated rules with AI integrated experience
Data Quality score in rule level (what is the quality score for a rule that applied to a column)
Data Quality score for Data assets, Data Products, and Governance Domains (one governance domain can have many data products, one data product can have many data assets, one data asset can have many data columns)
This is one of the key features of Purview Data Quality, it's ability to apply data quality rules to the logical construct of CDEs, which then propagate down to the physical data elements that comprise them. By defining data quality rules at the level of CDEs, organizations can establish specific criteria and thresholds that CDEs must meet to maintain their quality
Actions center for DQ with actions to address DQ anomaly states including diagnostic queries for DQ steward to zero in on the specific data to fix for each anomaly state.