SQLServerBrains: DQS

Showing posts with label DQS. Show all posts

Wednesday, 27 June 2012

DQS Cleansing Transform in SSIS 2012

Data Quality Services is a new SQL Server product in Denali which provides data cleansing capabilities. This post describes how to make use of the DQS Cleansing Transformation which is part of the default set of SSIS data flow components.

The DQS Cleansing transform can be found in the Data Flow Toolbox.

The first thing we need to set in the DQS Cleansing Transformation Editor is the Data Quality Connection Manager. This will point to our DQS installation residing on a SQL instance.

Once the connection manager has been created, you select the Knowledge Base (KB) you want to use. The current CTP comes with a default “DQS Data” KB, which contains domains that you’d use for address validation (at TechEd they mentioned that the final release may include additional domains out of the box). Selecting the Knowledge Base you want to use will bring up its list of domains.

There are two types of domains in this list; regular Domains (ex. City, State, Zip), and Composite Domains (also called CDs), which are made up of two or more regular domains. Composite Domains allow you to validate multiple fields as a single unit. For example, the “Company – US” composite domain in the default DQS Data KB is made up of Company Name, Address Line, City, State and Country. This lets you validate that “Microsoft Corporation” (Company Name) exists at “One Redmond Way” (Address Line), “Redmond” (City), “WA” (State), “USA” (Country) (given that you have the appropriate CD rules in your knowledge base, or the CD is attached to an online reference data provider). DQS would flag the entry as incorrect if you had “Seattle” as the City – even though Seattle is a valid city name, that’s not where the Microsoft office is.

There are two ways to make use of Composite Domains in SSIS:

A single (string) column – for this to work, all values must appear in the same order as the domains do. So using the “Company – US” example above, your column values would need to look like this: Microsoft Corporation, One Redmond Way, Redmond, WA, USA

Multiple columns – If you map a column to each domain of a composite domain, the row will be cleansed using the composite domain logic. If you have not mapped each domain from the composite domain, the columns will be cleansed individually

The Mapping tab allows you to select the columns you want to cleanse, and map them to domains in your knowledge base.

Note that the Domain drop down will automatically filter out columns with incompatible data types (for example, it won’t show domains with a String data type if you are using a DT_I4 column).

The Advanced tab has number of different options – most of which control the columns that will be included in the output.

Option	Description
Standardize output	Enables to standardize output according to domain settings (that can be defined in the DQS client application). There are two kinds of standardization. First is formatting output to Upper/Lower/Capitalize. Second is correcting to leading value (this is relevant in case of synonyms). You can see how this is defined in the client, Domain Management \| Domain Properties tab.
Confidence	A score that is given to any correction or suggestion. This score reflects to what extent the DQS server (or the relevant Reference Data Source) has confidence in the correction/suggestion.
Reason	In case that the output is different than the original value, this field explains why. For example, it can be Invalid because of domain rule. Or, it can be Corrected because of utilizing DQS Cleansing algorithm, standardized to leading value, etc.
Appended Data (additional data received from the reference data provider)	This setting is used when there are domains attached to a Reference Data Source (RDS). In this case, sometimes the RDS returns additional information – not only values associated with the mapped domains. For example, when sending address, it can return also Lat/Long. The Appended Data field includes this additional information.
Appended Data Schema	If the RDS returned additional information, the schema field consists of the schema of how to interpret this data.
Encrypt connection	This determines whether the connection to SQL Server will be encrypted (using SSL). (Note, this setting will most likely be moved to the Connection Manager in the future)

After mapping one of my input columns (StateProvinceName) to one of the domains (State), I check the Confidence and Reason boxes on the Advanced tab, and click OK to save the changes. If I connect the path to another transform and look at it’s metadata (right click on the path, Edit … and click the Metadata tab), I can see that four columns were added for each input column I chose to cleanse – Corrected_<column>, Status_<column>, Confidence_<column> and Reason_<column>. The first two show up by default – the last two are there because of the options I selected on the Advanced tab.

The Corrected column will contain the value for the field, whether or not it was corrected by DQS. This is most likely the field you’ll want to use later on in your data flow.

The Status column will contain the result of the correction operation. Possible values are

Status	Description
Correct	The value was already correct, and was not modified
Invalid	The value was marked as invalid for this domain
Corrected	The value was incorrect, but DQS was able to correct it. The Corrected column will contain the modified value.
Unknown	The value wasn’t in the current domain, and did not match any domain rules. DQS is unsure whether or not it is valid.
Suggestion	The value wasn’t an exact match, but DQS has provided a suggestion. If you include the Confidence field, you could automatically accept rows above a certain confidence level, and redirect others to a separate table for later review.

Runtime

Some things to note about the DQS Cleansing transform:

It is implemented as an Asynchronous component
It does not expose an Advanced Editor (although this might change for the final release)
It sends incoming data to the DQS server for cleansing

When using an RDS, this data may be in turn sent to the SQL Azure Data Market provider

Data Quality Services Server/Client in SQL Server 2012

DQS Server is installed as part of the SQL Server Setup. It includes database for storing the knowledgebase information and the policies and rules that are applied to business data during cleansing and matching activities. It also includes the DQS engine, which is what applies all the information and rules in the databases.

DQS Client is a stand-alone client that can be used by people in different roles to clean data, maintain the knowledge bases or configure DQS Server. You have to have sufficient permissions, either on the SQL Server as a whole or as a specific DQS role, in order to open the DQS Client and connect to the appropriate DQS Server.

DQS Client allows users and administrators to perform key duties: knowledge base management, data quality projects and administration.

Knowledge Base Management

As described in the previous post, the concept of a knowledge base is core to DQS. Within the knowledge base, you have a number of domains. A domain is a representation of a type of data that is used by the business, such as a customer name, an address, or a contact phone number. For each of these domains, there are certain values which can be trusted and certain values that are invalid. The example I used in the previous post was of counties. County would be a domain, with values like Nottinghamshire and Berkshire as trusted values because they are known to be real counties. You can include synonyms, so that the value of Notts is known to be referring to the same thing as Nottinghamshire.

Within the Knowledge Base Management part of the DQS Client, you can create and manage domains, set the trusted values and give examples of invalid values. Each knowledge base contains one or more managed domain.

Defining your knowledge base and all the domains, values and rules within it can be done manually through the DQS Client. It can also be done through the Knowledge Discovery process to automatically generic rules that can then be approved and included in the knowledge base.

Data Quality Projects

Once you’ve built your knowledge base and created the rules which are applicable to your data, you need to use them. The Data Quality Projects section of the client tool lets you apply your knowledge base for either cleansing of data or data matching.

Cleansing projects apply rules to find data that is incomplete, inaccurate or not in line with your policies. DQS applies the rules from your knowledge base where applicable and can come up with suggestions if there are some values which are not in the domain, but which are similar to domain values. These suggestions allow you to catch typos without having to think of a rule for every single mistake that might possibly be made.

Matching projects pull in the data and look for records which are similar. DQS applies the matching rules defined in the client and, using things like synonyms and similarity of phrases, highlights duplicate records. It can combine records where there is a definite match or highlight ones where there is a possible match, including a confidence score based on the similarity of the records and the specifics of the rules.

Both of these projects then allow you to export the cleaned data so that it can be used by your systems.

Administration

This section of the tool lets you track on-going activities within DQS and update the configuration settings.

A data steward can use this part of the tool to view what processes are in progress within DQS. That person can then choose to stop processes if required.

This is also the place to define the settings for using reference data. The idea of reference data is that you can pull in information from an external source, for example the Azure data marketplace, and use that as a comparison to clean up internal data (e.g. matching addresses to postcodes). You would use the administration part of the DQS Client to configure these connections.

You can also change some settings in this section, for things like logging and notifications.

So you can see that the DQS Client tool is a vital component of a data quality project because this is where so much of DQS functionality is implemented from.

Installing Data Quality Services

Data Quality Services is an option in the main install, but when you go to run the Data Quality Client, you’ll get a message stating that DQS is not installed. As far as I can tell, DQS needs to be installed manually, by running the DQSInstaller.exe, which you can find in the SQL Server Binn directory. This will create two SQL Server databases:

Data Quality Client

Once DQS is configured, you’ll be in a position to use the Data Quality Client, which is a windows application, available in 32 or 64 bit. Once you connect, you’ll get the following screen:

The knowledge base is the key to how DQS works, being utilised to determine the data quality of your source data. You can create your own knowledge base, using your own data, or even cloud-based data. For this example, I’m going to use the built in knowledge base called DQS Data.

As an example, I’ve created some data that I want to cleanse. It’s adventure works country data that I’ve put into a separate table that I’ve called dbo.Country.

My task now is is to clean this data, which I can do by creating a data quality project:

I’ve called my project Countries, and I’ve picked the built-in DQS Data knowledge base, which I noticed contains reference data for countries. The activity that I’ve selected is Cleansing, and then I’ve clicked Create.

DQS will then prompt for mapping the source data to one of the domains in the Knowledge Base. Here I’ve mapping my country name to the Country domain from the knowledge base:

After clicking the Start button on the next screen, the cleaning process starts, which gives the following results:

I put a couple of typos into my source data, such as changing Algeria to ‘Algerian’ and Albania to ‘Albana’. These were picked up by DQS, along with a few others, but a user has the opportunity to approve or reject, via the radio buttons shown above. I chose to approve the first two, and then clicked next. In the final screen, DQS allows you to output the cleaned data to a separate table, as shown below:

If you go into the table that DQS has created, you will see that there is a Name_Status column, which holds the cleaning status on a per record basis:

This can be used to update the source data, if required, and therefore address the data quality issues.

Data Quality in SQL Server 2012

Data Quality Services is a new set of functionality in SQL Server 2012, based around the concept of knowledge-driven quality.

The idea behind this concept is that there is a big difference between what’s invalid according to a computer system and what’s invalid according to common sense. Some system might have a field for a person’s age and the system thinks a value is valid if the age is given in numerical form as an integer. The system wouldn’t have a problem with an age listed as 222. A human being looking at this would spot this as an anomaly and guess that the person entering the age mistyped 22. Another example could be counties, stored as string values. A human being would know that Nottinghamshire is a valid county but Narnia isn’t. An automated system wouldn’t spot the mistake.

This is where a knowledge-driven solution comes in. The idea is to take these pieces of knowledge that we think of as common sense and store them in such a way that the business applications and databases can check values against a knowledge base. This knowledge base might include a list of counties so you can have the systems check all values entered in the County field of a database and look for anomalies.

You then have various rules for how these anomalies are treated. You might create automatic rules, for example to change the abbreviation Notts to the full Nottinghamshire so that you can have consistent values without needing a human being to make any changes. You might also have people, known as data stewards, who are alerted to suspected anomalies so that they can check whether values need to be fixed.

A data quality project tends to have two elements to it. One is an initial fix to clean up bad data. This is known as a data cleansing project. As the name implies, the end goal is to have a set of clean data. The tools look through the data, transforming values to match a standard, flagging outlying values that might be anomalies and suggesting changes that could be made. It also hunts for possible duplicates through data matching, applying policies to look for entries in the database that might refer to the same thing. For example, the entries for Jess Meats and Jessica Meats might be flagged up as a possible match. After the cleansing and matching, the output is delivered as a set of data that has been cured of as many inaccuracies as possible.

The second part of a data quality project is what happens next to keep the data clean. As with Master Data Management, this isn’t a fix-once act. It’s very easy for data quality issues to creep back in after the cleansing has taken place so an implementation of Data Quality Services needs to bear in mind what should happen next. The processes and policies need to be defined to ensure that the data quality knowledgebase is used in future to maintain the quality of the data. It’s also important to identify the data stewards who will be responsible for fixing any problems the knowledgebase flags.

It’s also important to think of the knowledgebase as an on-going project. Things change. Data changes. The knowledgebase should therefore also change. The set of knowledge and rules within the knowledgebase can grow over time, bringing more control and accuracy to your data. As more data passing through the knowledgebase, it becomes more tuned to picking out anomalies and better at identifying what the correct value should be.

A Data Quality Services project should include both the plan for how to clean the data initially and how to maintain quality moving forward.

Before either of these can start, however, you need to define what you want your data to look like. A common example is a male/female field. This seems like it should be simple, but one system might store this as male or female, another might have m or f, another might have 0 or 1, yet another might have 0 and 1 again, but with the digits referring to the opposite gender. If you’re merging systems, moving data between systems or pulling together data from multiple systems for reporting, you can have chaos. A key part of a data quality project is working out what you want to be the correct value. Once you’ve done that, you can start applying the rules to change the other values so you end up with consistency across your whole data set.

So when you’re starting to work with Data Quality Services, first take a look at your existing data and decide what you’d like it to look like. Then you can do your data cleansing and data matching to give yourself a clean and accurate set of data to start with. Then you need to hook your knowledgebase into your processes to ensure data quality moving forward.