Thursday, June 2, 2011

A quick review of SolidQ's latest e-book, on Data Quality and Master Data Management

The folks at SolidQ are providing a free e-book:

Data Quality and Master Data Management with Microsoft SQL Server 2008 R2
Dejan Sarka, Davide Mauri

I had a look through the book and I'm providing this tiny synopsis/review. The book can be found here.

Other than marketing blurbs, I haven't really been exposed to MDM until I read the book. So, I am starting more-or-less from scratch.

Chapter 1, "Master Data Management" seems to be a decent primer. It presents a background on relational and dimensional models, and where they and MDM fits in the universe.

Chapter 2, "Master Data Services Concepts and Architecture", discusses MDS in detail. I think that this gave me a good feel for how MDS is supposed to work.

Nonetheless, I think that most of that could be gleaned from other sources. The hidden gems here are in the second half of the book, which is comprised of chapters 3 and 4, "Data Quality and SQL Server 2008 R2 Tools" and "Identity Mapping and De-duplicating", respectively. These chapters are dense with information and technique-heavy, discussing various algorithms. T-SQL, CLR and SSIS tactics to find and quantify data quality issues are presented. The math and queries can become fairly complex, and the authors state that the book is not for newbies. but I think that they are eye-opening with respect to what can be done.

My biggest complaints are:
I wanted more depth on integrating MDS into new and existing LOB applications, particularly with identity mapping between the MDS and (possibly a multitude) of LOB applications. If I have two LOB applications, each with their own surrogate key for customer data, how do I manage keeping everything updated?

I would  like to see more information on updating information that is in the MDS. For example, when a client changes addresses, do I have to manually go in and tweak things through the MDS web interface? Do I have to go through the whole import process again? What if I have 40 million customers and I get thousands of changes a day? Is there a programmatic interface for doing this? Is there some way to replicate changes back to LOB apps? Do I remove the customer-address-update page from my LOB apps and rely entirely on MDS? How would CSRs submit change to this address data?

Another thing that I am not 100% clear on is what attributes belong in the MDS and what do not. Obviously, a customer's address does belong in the MDS but something like a customer account balance  would not. If we are talking about products, then inventory levels do not belong in the MDS, but what about something like shelf location in a warehouse? Do parts lists make for good hierarchies? Etc. I presume that there even finer shades of grey than the rough examples I am giving here.

Generally, I think that the books is definitely worth reading (at twice the price!) for anyone who is working with MDS or who has a data cleansing project of any kind in front of them.