Monday, November 25, 2013

Reference vs Master

Had an interview this morning with some industry analysts who were researching Enterprise Architecture subjects. Per usual, the subject of Master Data Management came up. As sometimes happens, I verbalized something quite important that I previously hadn't had a chance to write down.

In the conversation, we went through the last eleven (11) master data management initiatives I've been involved with and looked for the common thread. In every engagement where I was called in because they were failing and in every engagement where we collectively failed, it was because Master Data was attempted before Reference Data. In every case where the engagement went smooth or where we were able to get things back on track, it was because we prioritized a single case of Reference Data and then iterated.

This seems glaring obvious in retrospect. Prior to this impromptu postmortem, I hadn't realize our collective experiences painted such a clear picture of how not to screw something up.

If you are thinking about Master Data Management, you first need to build good Reference data. The underlying reason is that most people, even architects, don't fully understand and recognize the difference. Since we don't always separate these types of data, we don't always prioritize properly and then we are focusing on a moving, complicated target and this increases the likely of challenges leading to failure.

Reference data is non-volatile, exists independent of business process and interactions, and is globally identifiable. Master data is slow-changing and can exist independent of a business process. A geographic location (lat/long) is reference data, whereas a physical address is master data. An address, like many kinds of master data, is built from reference data. For example, an address may include a geographic location. The name of the building might come from master data but the name of the region or country would come from reference data. Why is a country name reference data but the name of the building is not? Because of the global identification. If the globally identified data comes from an outside source, managed externally, then it is reference data. Understanding which components of your data constructs are reference instead of master is the first, most crucial step.

Why did the prioritization for reference data become a leading indicator for success? Because reference data is non-volatile and can withstand the winds of change within an organization. It gives an immovable target with quantified, known complexity that can be addressed. And like other forms of data it needs to be published, consumed, syndicated, replicated, secured, and so forth. Rather than figure out how these functional and technical capabilities must be delivered and iterated within your organization using volatile data of potentially unknown complexity, you can first provide these functions using data constructs of known complexity and fixed definition. Only once you have a tried and tested cadence for the cyclic functions, and proven templates for the iterative functions, can you then embrace more complex and volatile master data.

A real world anecdote is from a conversation I had a few short months ago with a colleague struggling to get traction on a Master Data Management engagement. This organization had multiple sources of customer data and like many organizations had determined they needed a master customer record. But how to wrangle twenty-seven (no joke, they have 27 customer relationship management systems!) and their dependent systems into all using a single master. To start with they all have different formats, fields, and definitions. They had many different ways to identify a master record and many allowed multiple records for a single customer for various purposes. When you expanded to include the down-stream systems that relied on those CRM databases the number exploded to over a hundred. No one wanted to take on the challenge and I don't blame them.

We made a plan, which they followed and they are now on they're sixth successful iteration.

Rather than try consolidating the entire records straight away, we just started with customer name. They created an extremely simple reference data-set of all customer names by pulling extracts from all systems. Every entry was given a unique key that matched the key from the system it came from. This list (10+ million records) was run through de-duplication and data quality software. What resulted was a single list of unique names with unique identifiers. Each unique identifier had alternate keys attached for as many of the source systems as had references for that customer. Then we published it at a fixed location and made it available in several ways (SQL, XML, CSV, etc).

Each system then undertook an exercise at their own pace of reconciling their data with the master list. Some used replication and pulled the master list nightly just making the single list their new source. Others used a combination of queries and manual entry to reconcile. By calendar week 8, 30% of the systems were using the customer name master list as a fully integrated data source.

It only took two weeks for the data mastering crew to clean and prep the customer name master. Which they handed off to the support team who helped all those systems with consuming that data. By calendar week 6, the data mastering crew handed over an address master to the support team. By calendar week 12, 30% of the systems were using the address master lists as fully integrated data sources.

By calendar week 8, the data mastering crew handed over employee and organization master lists to the support team. By calendar week 12, they had handed off office, contact information, and account.

The organization is 5 months in and they have 90% integration across all systems for customer, address, employee and organization. Along the way they have established and iterated patterns and processes for replicating data, auditing compliance, publishing secure feeds, and publishing secure subsets of data by organization. They even decommissioned two smaller systems just by cleaning and securing access to a single list of list. There are 11 more planned for decommissioning over the 6 months.

They got their footing by focusing on reference data. Every time they take on a new set, they first start with the reference data. Allowing them to separate this means they can separate the application from the data just enough to make progress. Once they get the pattern in place, they iterate to add complexity and features and expand the data set. It's methodical and direct and mostly non-threatening.

At the time, I didn't have a codified reason for why starting with reference data was so important. Now I've written down why it was good advice.