Wednesday, 27 June 2012

Data Quality Services Server/Client in SQL Server 2012

DQS Server is installed as part of the SQL Server Setup. It includes database for storing the knowledgebase information and the policies and rules that are applied to business data during cleansing and matching activities. It also includes the DQS engine, which is what applies all the information and rules in the databases.

DQS Client is a stand-alone client that can be used by people in different roles to clean data, maintain the knowledge bases or configure DQS Server. You have to have sufficient permissions, either on the SQL Server as a whole or as a specific DQS role, in order to open the DQS Client and connect to the appropriate DQS Server.

DQS Client allows users and administrators to perform key duties: knowledge base management, data quality projects and administration.

Knowledge Base Management

As described in the previous post, the concept of a knowledge base is core to DQS. Within the knowledge base, you have a number of domains. A domain is a representation of a type of data that is used by the business, such as a customer name, an address, or a contact phone number. For each of these domains, there are certain values which can be trusted and certain values that are invalid. The example I used in the previous post was of counties. County would be a domain, with values like Nottinghamshire and Berkshire as trusted values because they are known to be real counties. You can include synonyms, so that the value of Notts is known to be referring to the same thing as Nottinghamshire.

Within the Knowledge Base Management part of the DQS Client, you can create and manage domains, set the trusted values and give examples of invalid values. Each knowledge base contains one or more managed domain.

Defining your knowledge base and all the domains, values and rules within it can be done manually through the DQS Client. It can also be done through the Knowledge Discovery process to automatically generic rules that can then be approved and included in the knowledge base.

Data Quality Projects

Once you’ve built your knowledge base and created the rules which are applicable to your data, you need to use them. The Data Quality Projects section of the client tool lets you apply your knowledge base for either cleansing of data or data matching.

Cleansing projects apply rules to find data that is incomplete, inaccurate or not in line with your policies. DQS applies the rules from your knowledge base where applicable and can come up with suggestions if there are some values which are not in the domain, but which are similar to domain values. These suggestions allow you to catch typos without having to think of a rule for every single mistake that might possibly be made.

Matching projects pull in the data and look for records which are similar. DQS applies the matching rules defined in the client and, using things like synonyms and similarity of phrases, highlights duplicate records. It can combine records where there is a definite match or highlight ones where there is a possible match, including a confidence score based on the similarity of the records and the specifics of the rules.

Both of these projects then allow you to export the cleaned data so that it can be used by your systems.

Administration

This section of the tool lets you track on-going activities within DQS and update the configuration settings.

A data steward can use this part of the tool to view what processes are in progress within DQS. That person can then choose to stop processes if required.
This is also the place to define the settings for using reference data. The idea of reference data is that you can pull in information from an external source, for example the Azure data marketplace, and use that as a comparison to clean up internal data (e.g. matching addresses to postcodes). You would use the administration part of the DQS Client to configure these connections.
You can also change some settings in this section, for things like logging and notifications.

So you can see that the DQS Client tool is a vital component of a data quality project because this is where so much of DQS functionality is implemented from. 

 Installing Data Quality Services

Data Quality Services is an option in the main install, but when you go to run the Data Quality Client, you’ll get a message stating that DQS is not installed. As far as I can tell, DQS needs to be installed manually, by running the DQSInstaller.exe, which you can find in the SQL Server Binn directory. This will create two SQL Server databases:


Data Quality Client

Once DQS is configured, you’ll be in a position to use the Data Quality Client, which is a windows application, available in 32 or 64 bit. Once you connect, you’ll get the following screen:


The knowledge base is the key to how DQS works, being utilised to determine the data quality of your source data. You can create your own knowledge base, using your own data, or even cloud-based data. For this example, I’m going to use the built in knowledge base called DQS Data.
As an example, I’ve created some data that I want to cleanse. It’s adventure works country data that I’ve put into a separate table that I’ve called dbo.Country.
My task now is is to clean this data, which I can do by creating a data quality project:


I’ve called my project Countries, and I’ve picked the built-in DQS Data knowledge base, which I noticed contains reference data for countries. The activity that I’ve selected is Cleansing, and then I’ve clicked Create.

DQS will then prompt for mapping the source data to one of the domains in the Knowledge Base. Here I’ve mapping my country name to the Country domain from the knowledge base:


After clicking the Start button on the next screen, the cleaning process starts, which gives the following results:

I put a couple of typos into my source data, such as changing Algeria to ‘Algerian’ and Albania to ‘Albana’. These were picked up by DQS, along with a few others, but a user has the opportunity to approve or reject, via the radio buttons shown above. I chose to approve the first two, and then clicked next. In the final screen, DQS allows you to output the cleaned data to a separate table, as shown below:


If you go into the table that DQS has created, you will see that there is a Name_Status column, which holds the cleaning status on a per record basis:


This can be used to update the source data, if required, and therefore address the data quality issues.


No comments:

Post a Comment