Data Quality Services is a new set of functionality in SQL Server 2012, based around the concept of knowledge-driven quality.
The idea behind this concept is that there is a big difference between what’s invalid according to a computer system and what’s invalid according to common sense. Some system might have a field for a person’s age and the system thinks a value is valid if the age is given in numerical form as an integer. The system wouldn’t have a problem with an age listed as 222. A human being looking at this would spot this as an anomaly and guess that the person entering the age mistyped 22. Another example could be counties, stored as string values. A human being would know that Nottinghamshire is a valid county but Narnia isn’t. An automated system wouldn’t spot the mistake.
This is where a knowledge-driven solution comes in. The idea is to take these pieces of knowledge that we think of as common sense and store them in such a way that the business applications and databases can check values against a knowledge base. This knowledge base might include a list of counties so you can have the systems check all values entered in the County field of a database and look for anomalies.
You then have various rules for how these anomalies are treated. You might create automatic rules, for example to change the abbreviation Notts to the full Nottinghamshire so that you can have consistent values without needing a human being to make any changes. You might also have people, known as data stewards, who are alerted to suspected anomalies so that they can check whether values need to be fixed.
A data quality project tends to have two elements to it. One is an initial fix to clean up bad data. This is known as a data cleansing project. As the name implies, the end goal is to have a set of clean data. The tools look through the data, transforming values to match a standard, flagging outlying values that might be anomalies and suggesting changes that could be made. It also hunts for possible duplicates through data matching, applying policies to look for entries in the database that might refer to the same thing. For example, the entries for Jess Meats and Jessica Meats might be flagged up as a possible match. After the cleansing and matching, the output is delivered as a set of data that has been cured of as many inaccuracies as possible.
The second part of a data quality project is what happens next to keep the data clean. As with Master Data Management, this isn’t a fix-once act. It’s very easy for data quality issues to creep back in after the cleansing has taken place so an implementation of Data Quality Services needs to bear in mind what should happen next. The processes and policies need to be defined to ensure that the data quality knowledgebase is used in future to maintain the quality of the data. It’s also important to identify the data stewards who will be responsible for fixing any problems the knowledgebase flags.
It’s also important to think of the knowledgebase as an on-going project. Things change. Data changes. The knowledgebase should therefore also change. The set of knowledge and rules within the knowledgebase can grow over time, bringing more control and accuracy to your data. As more data passing through the knowledgebase, it becomes more tuned to picking out anomalies and better at identifying what the correct value should be.
A Data Quality Services project should include both the plan for how to clean the data initially and how to maintain quality moving forward.
Before either of these can start, however, you need to define what you want your data to look like. A common example is a male/female field. This seems like it should be simple, but one system might store this as male or female, another might have m or f, another might have 0 or 1, yet another might have 0 and 1 again, but with the digits referring to the opposite gender. If you’re merging systems, moving data between systems or pulling together data from multiple systems for reporting, you can have chaos. A key part of a data quality project is working out what you want to be the correct value. Once you’ve done that, you can start applying the rules to change the other values so you end up with consistency across your whole data set.
So when you’re starting to work with Data Quality Services, first take a look at your existing data and decide what you’d like it to look like. Then you can do your data cleansing and data matching to give yourself a clean and accurate set of data to start with. Then you need to hook your knowledgebase into your processes to ensure data quality moving forward.
No comments:
Post a Comment