Email Address is a Really Bad Key!
At its heart, record matching is about uniqueness. We need to find what makes a record unique and then identify all the other records, in that data set and others, that share the properties that make it unique. That combination of attributes, or columns, that make a record unique is a unique key for that record in that data set (BTW, a unique key is not necessarily a primary key).
Think about person data and take my personal situation as an example. According to this website, there are 129,507 people in the US named Ben. Other websites have different counts, but you get the idea – in most cases first name is probably not a good unique identifier / unique key for a person. Add a last name of Taub and you’ve brought the total way down to 2,993. But the combination of first name + last name is still a really bad unique identifier. To drive home the point, one of the other Ben Taubs financed a major hospital in Houston, Ben Taub Hospital. Once again, this was another Ben Taub, not me (although, for the record, I did once meet people who knew him).
While matching technologies, like Golden Record, may be used to match any kind of data (products, vehicles, locations…) their major application always seems to come back to matching people. And, when matching people, assuming that we don’t have social security numbers, most folks default to relying on a person’s email address as their unique key. Bad idea.
Here are a few reasons why email address makes a really bad key for most applications…
Why email address makes a really bad key
People share email addresses
I recently completed a hugely rewarding consulting gig with The Henry Ford, a major Detroit museum, educational, and cultural institution (you really should check it out!). One of our goals was to identify patrons but, as we quickly found out, email address was a horrible key. Why? Well it turned out that multiple people shared email addresses. Mothers provided information about other family members but used their own email addresses. Fathers established family accounts for multiple people using their email addresses. Grandparents sometimes even entered the picture.
The point is, we had no way of uniquely identifying individuals using email address alone.
People have multiple email addresses
I’ll bet you have at least two email addresses. According to this web site, most people actually have just under two. Which one represents you? They both do, of course, but it’s hard to work with two identifiers for one entity.
People change email addresses
I’ll also bet that your email address today is not your first ever email address. If I attached all of your sales records to your email address and you subsequently changed that address, I’d lose the connection between you and your purchase history. In other words, email address is not a great way to identify you. (As I noted in this blog post, Golden Record includes technology to overcome this problem).
Some people still don’t have email addresses
Believe it or not, some folks still don’t have email addresses. More frequently, they won’t share them. How many times have you entered email@example.com into a web form in order to keep yourself off of spam email lists? I know you have. I certainly have.
How does Golden Record find unique people?
Our Golden Record matching approach is totally configurable. So, if you want to match person records across data sets based on email address, you can. But, for most applications I wouldn’t recommend it. Assuming you don’t already have a good identifier, like a customer or a social security number, I’d recommend experimenting with a combination of fields. A combination that worked really well for us in the past was when a person record in the two sources had…
- The same first name, and
- The same last name, and
- The same value for at least one of the following items:
- Email address
- Telephone number
- Mailing address
The trick is to know something about your data and use that knowledge to find a formula that maximizes correct matches while eliminating false matches.
So, what is email address a good key for?
Ok, I’ve already run on too long but, since I’m going, is email address a good unique key for anything? I can think of one thing – if you’re analyzing data about that email address. For example, if you wanted to study how many emails the average email address (but not the average person!) receives in a day, email address makes the perfect key!
OK, now I’m done.
If you’re building a data lake and need to tie together your data sources, or have any other data matching need, let’s talk! I think we can make your life easier. You can email me directly at Benjamin.Taub@Dataspace.com (not a good unique key for me, BTW).
Thanks for reading!
Leave a ReplyWant to join the discussion?
Feel free to contribute!