Wednesday 27 June 2012

DQS Cleansing Transform in SSIS 2012

Data Quality Services is a new SQL Server product in Denali which provides data cleansing capabilities. This post describes how to make use of the DQS Cleansing Transformation which is part of the default set of SSIS data flow components.

The DQS Cleansing transform can be found in the Data Flow Toolbox.


The first thing we need to set in the DQS Cleansing Transformation Editor is the Data Quality Connection Manager. This will point to our DQS installation residing on a SQL instance.



Once the connection manager has been created, you select the Knowledge Base (KB) you want to use. The current CTP comes with a default “DQS Data” KB, which contains domains that you’d use for address validation (at TechEd they mentioned that the final release may include additional domains out of the box). Selecting the Knowledge Base you want to use will bring up its list of domains.


There are two types of domains in this list; regular Domains (ex. City, State, Zip), and Composite Domains (also called CDs), which are made up of two or more regular domains. Composite Domains allow you to validate multiple fields as a single unit. For example, the “Company – US” composite domain in the default DQS Data KB is made up of Company Name, Address Line, City, State and Country. This lets you validate that “Microsoft Corporation” (Company Name) exists at “One Redmond Way” (Address Line), “Redmond” (City), “WA” (State), “USA” (Country) (given that you have the appropriate CD rules in your knowledge base, or the CD is attached to an online reference data provider). DQS would flag the entry as incorrect if you had “Seattle” as the City – even though Seattle is a valid city name, that’s not where the Microsoft office is.

There are two ways to make use of Composite Domains in SSIS:
  1. A single (string) column – for this to work, all values must appear in the same order as the domains do. So using the “Company – US” example above, your column values would need to look like this: Microsoft Corporation, One Redmond Way, Redmond, WA, USA
  2. Multiple columns – If you map a column to each domain of a composite domain, the row will be cleansed using the composite domain logic. If you have not mapped each domain from the composite domain, the columns will be cleansed individually
The Mapping tab allows you to select the columns you want to cleanse, and map them to domains in your knowledge base.



Note that the Domain drop down will automatically filter out columns with incompatible data types (for example, it won’t show domains with a String data type if you are using a DT_I4 column).
The Advanced tab has number of different options – most of which control the columns that will be included in the output.


OptionDescription
Standardize outputEnables to standardize output according to domain settings (that can be defined in the DQS client application). There are two kinds of standardization. First is formatting output to Upper/Lower/Capitalize. Second is correcting to leading value (this is relevant in case of synonyms). You can see how this is defined in the client, Domain Management | Domain Properties tab.
ConfidenceA score that is given to any correction or suggestion. This score reflects to what extent the DQS server (or the relevant Reference Data Source) has confidence in the correction/suggestion.
ReasonIn case that the output is different than the original value, this field explains why. For example, it can be Invalid because of domain rule. Or, it can be Corrected because of utilizing DQS Cleansing algorithm, standardized to leading value, etc.
Appended Data
(additional data received from the reference data provider)
This setting is used when there are domains attached to a Reference Data Source (RDS). In this case, sometimes the RDS returns additional information – not only values associated with the mapped domains. For example, when sending address, it can return also Lat/Long. The Appended Data field includes this additional information.
Appended Data SchemaIf the RDS returned additional information, the schema field consists of the schema of how to interpret this data.
Encrypt connectionThis determines whether the connection to SQL Server will be encrypted (using SSL). (Note, this setting will most likely be moved to the Connection Manager in the future)


After mapping one of my input columns (StateProvinceName) to one of the domains (State), I check the Confidence and Reason boxes on the Advanced tab, and click OK to save the changes. If I connect the path to another transform and look at it’s metadata (right click on the path, Edit … and click the Metadata tab), I can see that four columns were added for each input column I chose to cleanse – Corrected_<column>, Status_<column>, Confidence_<column> and Reason_<column>. The first two show up by default – the last two are there because of the options I selected on the Advanced tab.

The Corrected column will contain the value for the field, whether or not it was corrected by DQS. This is most likely the field you’ll want to use later on in your data flow.

The Status column will contain the result of the correction operation. Possible values are

StatusDescription
CorrectThe value was already correct, and was not modified
InvalidThe value was marked as invalid for this domain
CorrectedThe value was incorrect, but DQS was able to correct it. The Corrected column will contain the modified value.
UnknownThe value wasn’t in the current domain, and did not match any domain rules. DQS is unsure whether or not it is valid.
SuggestionThe value wasn’t an exact match, but DQS has provided a suggestion. If you include the Confidence field, you could automatically accept rows above a certain confidence level, and redirect others to a separate table for later review.


Runtime

Some things to note about the DQS Cleansing transform:
  • It is implemented as an Asynchronous component
  • It does not expose an Advanced Editor (although this might change for the final release)
  • It sends incoming data to the DQS server for cleansing
    • When using an RDS, this data may be in turn sent to the SQL Azure Data Market provider


No comments:

Post a Comment