Data Lake: Why record matching is critical to success
Today’s data lakes are like yesterday’s operational data stores (ODSs): Everyone has one and everyone means something different when they use the term. In my experience, a data lake is a place where analysts and data scientists can store and analyze data. A lot of that data comes from internal systems, like CRM and sales systems. Frequently, a lot more data in the lake comes from outside sources, like purchased demographics and census data.
However, not all data sources provide you with a good key you can use to match the records across them. While it would be great if every vendor and data source could give you records with your customer ID number on them, this rarely happens. So, how do you know if the Bob Smith in your purchased list of auto owners by county is the same Bob Smith that’s in your sales system? In fact, you may have multiple Bob Smiths in your sales system. Some may actually be duplicate records.
The problem isn’t limited to data you get from outside sources. I’ve seen situations where a client has multiple internal systems, each containing customer data, and none sharing a common key.
So, to tell that Bob Smith – customer number 13B72, is also Robert Smith – Porsche owner at 1362 Elm street, you need effective matching. Luckily, this is exactly the kind of problem Golden Record solves!
If you’re building a data lake and need to tie together your data sources, let’s talk! I think we can help you solve your problem. You can email me directly at Benjamin.Taub@Dataspace.com.
Thanks for reading!
Leave a ReplyWant to join the discussion?
Feel free to contribute!