Book contents
5 - Basic Data Cleansing
Published online by Cambridge University Press: 09 November 2021
Summary
Cleansing personal data
We have focused a lot on classification and normalisation, but what about other data cleansing such as supplier or customer addresses? I have seen first hand how many duplicates can occur if you’re working with multiple systems.
I had a client with nine different sources of data, which when merged totalled 2.8 million rows of customer information. When I cleansed this, it reduced to 1.3 million rows of data. I certainly didn't do this in Excel, and there are a number of tools available that can help, but if you’re on a budget or need to get something done relatively quickly, I can show you how to achieve this using Excel.
It would be hard to cover all the different formats that addresses can come in. Quite often you will see a mixture of formats, such as everything in one cell or spread across many cells or with the information in the wrong column. I’ll be focusing on names, addresses and e-mails for this exercise, but you will most likely have telephone numbers as well.
I’ll share as an example what I’m going to call my sample customer list (Figure 5.1 on the next page). It's got a name and address and an e-mail column and there is a mixture of upper and lower case data, as well as sentence case. This is not unusual and I see this in a lot of files. Again, it's down to creating some standards and making sure everyone applies them.
As all the formulas used for this process are in previous chapters, there will be less illustrations in this chapter, but feel free to refer back to Chapters 1, 2 and 4 if you’d like some visual support.
Cleansing names in Excel
There are a number of different ways to approach this and it very much depends on what you need for your business. In my sample file, the names are all in one column. Is that how you want them to stay? Or do you want to split them out into first name and last name? This is most likely something that is decided elsewhere in the business, but it doesn't mean you can't use their standards and formatting and be consistent.
- Type
- Chapter
- Information
- Between the SpreadsheetsClassifying and Fixing Dirty Data, pp. 99 - 110Publisher: FacetPrint publication year: 2021