Rare x Rare x Rare in Customer Data Management

By on in with 5 Comments

I once had an operations management professor who asked the class how often we would expect a product to be defective if it was made of 10 components, each of which had a 1% defect rate, if a single component failure would result in the entire product not working.

The math is pretty simple:

99% x 99% x 99% x 99% x 99% x 99% x 99% x 99% x 99% x 99% = 90.4%

Only 90.4% of the finished products would work? That doesn’t seem good at all! Considering that there are very few manufactured products — especially electronic ones — that have only ten critical parts, it was an eye-opening insight (albeit obvious in hindsight).

EquationsThe point the professor was making was that there are many cases where “99% perfect” really isn’t good enough when that one part is considered in a larger context.

Of late, I’ve had a few run-ins with the opposite insight. Stick with me — it’ll be fun!

For starters, customer data management processes are not automated manufacturing processes. Customers are people, and people are messy!

In a manufacturing environment, a key way to drive quality is to remove as much variability as possible by strictly controlling the environment. Customers (people) are none too keen about being “strictly controlled.” From a pure (read: manufacturing) customer data perspective, what we’d like is:

  • To have every human assigned a unique ID
  • To have every human log into a system once a week and update all sorts of meta data about themselves:
    • Who they are related to and how (using those people’s unique IDs)
    • What products they own
    • How old they are
    • How much they weigh
    • What their favorite flavor or ice cream is
    • What political party they support
    • …and so on
  • To enter all of this information from drop-down lists so that all of the data is structured
  • To have them be very, very careful when they update this information, maybe even swearing that the data is perfectly accurate, under penalty of severe consequences

Obviously, that ain’t gonna happen.

Our processes to manage customer data are very different from manufacturing processes for one simple reason: the data does not have to be perfect. It has to be good enough for us to effectively interact with our customers.

Here is where a similar example to the one that started this post comes into play. When working on processes that deal with customer data — creating or maintaining it — we all develop use cases and scenarios to ensure that we are keeping the data as accurate as possible. It is exceedingly easy (and awfully tempting) to start working with scenarios that are theoretically ossible but not very probable. If we’re not wearing our Hat of Practicality, we will find ourselves developing processes that are so inordinately complex that one of two things happen:

  • We never get the new process implemented because it collapses under its own developmental weight, or
  • We implement it, but it is so complex that it collides with itself and starts generating bad customer data!

Is this sounding theoretical? I’ll illustrate with an example.

A couple of weeks ago, I ran into an issue that had to do with a new third-party data cleansing process that we are introducing that involved sending customer name and address data to a third party service (all over obscenely secure channels and with no more personal information than could easily be found in a phone book or through Yahoo! People). During testing, we came across some unexpected behavior as to how the third party vendor handled hyphenated last names. The initial proposal was to throw out responses for any customer who had a hyphenated last name. Something seemed amiss with that approach.

I thought up the most plausible scenario I could where the returned data would actually be incorrect, and it looked like this (I’ll spare you the details as to why this scenario was the most plausible — just trust me):

  • John Smith marries Mary Jones and they both keep their original surnames
  • They have a son named John Smith-Jones
  • When John Smith-Jones is a teenager, he becomes a customer of the same company of which his dad is a customer
  • When John Smith-Jones graduates from high school, he moves out of the house, while both he and his father remain customers of that same company

In this scenario…the process would be a little broken — in a way that the customer (the father, in this case) would probably understand and would definitely be able to easily correct.

So, here comes the math. Without doing any research beyond my own gut-based estimates from 37 years of experience on planet Earth, I made conservative estimates for all of the variables involved:

  • The percent of all married couples in the U.S. where both parties have kept their original surnames: 1%
  • The percent of all kids in the U.S. with hyphenated surnames: 0.5%
  • The percent of all kids in the U.S. who share the same name as their mother or father: 2%
  • How often a kid in the U.S. is a separate, distinct customer of the same company that his parents are (in the particular space this company is in) at the point that he/she leaves home: 75%

Then comes the math. It works just the opposite from the original equation, in that it is an “AND” situation rather than an “OR” situation — all of these factors had to be met in order for the process to make an erroneous customer data update (as opposed to any one of the components having to be defective in order for the final product to be defective):

1% x 0.5% x 2% x 75% = 0.000075% (!!!)

If my estimates were accurate, which they almost assuredly were not, then we would make this customer data error roughly once for every one million customers. If you think about it, you realize that the absolute accuracy of the small percentages just doesn’t really matter once those small percentages start multiplying. Let’s say I was off by a factor of four on my estimate of the percent of kids with hyphenated last names, so the formula above should have 2% where it had 0.5%. That ups the likelihood of this data error occurring to 3 times in a million rather than less than one — given the highly non-catastrophic nature of the error, this is still an “almost never” when it comes to looking at the types of other, more critical customer data errors that occur day in and day out.

In this case, there was another factor that I could have applied, and that was, for those one million customers, how many would be affected in any given year? 13% is a fair estimate of how many people move each year in the U.S., which means we would need to apply that percentage to the original result…and we’re back to “effectively never” for our likelihood of occurence.

There are a couple of caveats here, and they’re important:

  • I came up with one scenario. If there were four other plausible scenarios that were all equally likely to occur, then I would need to multiply the final result by five. In this case, we’re still talking a very small number, but there may be cases where a particular process gap could cause problems in a long tail’s worth of scenarios and may need to be viewed differently
  • It’s worth vetting the estimates somewhat — not through extensive research, necessarily, but at least by running them by a couple of sharp people to see if they pass the sniff test

In this example, we were deep into testing — well past the point where code updates could be made without introducing risk to the overall implementation. To me, it was a no-brainer — proceed as planned!

The pushback I’ve received in other, similar situations, has been: “Well, yeah, that’s only one person in a million. But…what about that one person?!” THAT gets us to my next post, which will be about Type 1 vs. Type 2 errors and cognitive dissonance when it comes to both knowing that the status quo is bad but also assuming the status quo is right. More on that next time!

Photo by Bill Burris

5 Comments


  1. Tim,
    Great post again, and I think you hit on a key point that marketers can often make a black to white flip – going from ignoring all data problems at all, to a sudden need for absolute perfection. Neither is a good option.

    Looking at the probabilities of an issue and the downside risk is key in understanding whether a potential data problem is a real one.

    As an aside, I love the use of quantum physics probability calculations in your imagery – how perfectly appropriate…

  2. Good post Tim.
    What if that one out of million case is Tim Wilson?
    Is your son having early signs of wanting to blog about data and analytics?

  3. Pingback Type I vs. Type II Errors in Customer Data Management | Gilligan on Data by Tim Wilson

  4. Pingback The Inertia of the Status Quo | Gilligan on Data by Tim Wilson

Leave your Comment


Notify me of followup comments via e-mail. You can also subscribe without commenting.

« »