Print this page

Removing Duplicates from CSV file prior to importing

Knowledge Article Number 000182414
Description
Prior to importing data into Salesforce, you might want to do a clean up of your csv. files and make sure that there isn't any duplicate information in it. 

Note: Every scenario is different so before following the instructions on this article, we recommend that you take a quick look to make sure that it applies to your specific case. For example, these steps assume that most rows of your csv. file, will have some sort of unique identifier (i.e email, phone number, social security, etc).

Scenario 1 - Email address is used as unique identifier

A. Steps to identify duplicate email addresses:

1. Add another column next to the Email column on your csv file and name it something like "Duplicate Email". (Lets assume the original Email column is column C and the "Duplicate Email" column is D)

2. Sort the Original Email column alphabetically from A-Z.

3. On the "Duplicate Email" column, create the following formula  =IF(C2=C3,"Duplicate Email", "Unique")

4. Hover your mouse on the lower right corner of the selected cell, until your mouse becomes a thin cross. Then, drag the values down to the last row of the column. 

What does the formula do?

The formula will display "Duplicate Email" if the email address below the current email address (in column C), is the same as the current email address. For example:

 
Original EmailDuplicate Email
report@mytest.comDuplicate Email
report@mytest.comUnique
folder@mytest.comUnique
home@mytest.comUnique

Notice that the first row has a duplicate value below it, therefore the formula displays "Duplicate Email".
 Now what you need to do is hard code the values of this column, so that the they will not change when you sort it. To hard code this column.

1. Click on column D (to select all the values on this column).

2. Right click on a highlighted area and select Cut.

3. Insert a new column next to "Duplicate Email" ,right click the first row and left click on Paste Special.

4. Select Values and click the OK button.

5. Delete the "Duplicate Email" column (where the formula is written)  as it is no longer needed. 

Once you have done this, sort the order of the new column alphabetically from A-Z and erase all the those rows that have the value "Duplicate Email".

Alternate method: Conditional formatting and sorting (Excel 2007)

1. Select the column that contains unwanted duplicate values.

2. From the Home tab, select Conditional Formatting > Highlight Cells Rules > Duplicate Values... > OK. All duplicate values will be highlighted in red.

3. Select the entire worksheet.

4. Still on the Home tab, select Sort & Filter > Custom Sort...

5. Select the column with duplicates for Sort By, Cell Color for Sort On > OK. Unique values will sort to the top of the sheet, leaving duplicates below.

7. Review the duplicate values in red and delete any unwanted rows.

Scenario 2 - No email address specified, other fields (phone number, name, etc) are used as unique identifiers. 

If you don't have a unique identifier, you may try to use the first and last name to find possible duplicates.

The following scenario, assumes if two or more contacts, share the same First Name AND Last Name, they are the same person, therefore these would be duplicate contacts.

It also assumed that the First Name and Last Name are in different columns.  

A. You first need to concatenate the First Name and Last Name. 

1. Add a new column called "Full Name" next to the First Name and Last Name columns. 

2. On the "Full Name" column, add the following formula (Assuming that First Name is column A and Last Name is column B)

Formula: =Concatenate(A2," ",B2)

What does the formula do?

The formula concatenates the First Name and Last Name in a single column.
 
First NameLast NameFull Name
JohnSmithJohn Smith
JohnDoeJohn Doe
AlexYuAlex Yu
AlexYuAlex Yu

B. You need to hard code the values of the "Full Name" column, so that they will not change when you sort it. To hard code this column.

1. Click column C (the "Full Name" column) to select all the values on this column.

2. Right click on a highlighted area and select Cut.

3. Insert a new column next to "Full Name" ,right click the first row and left click on Paste Special.

4. Select Values and click the OK button.

5. Delete the original "Full Name" column (where the formula is written)  as it is no longer needed. 

C. Steps to identify duplicate names

1. Add another column next to the "Full Name" column and name it something like "Duplicate Name".

2. Sort the "Full Name" column alphabetically from A-Z.

3. On the "Duplicate Name" column, create the following formula  =IF(C2=C3,"Duplicate Name", "Unique")

4. Hover your mouse on the lower right corner of the selected cell, until your mouse becomes a thin cross. Then, drag the values down to the last row of the column. 

D.  You need to hard code the values of the "Duplicate Name" column, so that the they will not change when you sort it. To hard code this column.

1. Click column D (the "Duplicate Name" column) to select all the values on this column.

2. Right click on a highlighted area and select Cut.

3. Insert a new column next to "Duplicate Name" ,right click the first row and left click on Paste Special.

4. Select Values and click the OK button.

5. Delete the original "Duplicate Name" column (where the formula is written)  as it is no longer needed. 

Once you have done this, sort the order of the new column alphabetically from A-Z and erase all the those rows that have the value "Duplicate Name".

Note: The alternate method using conditional formatting described above can be used in place steps C and D.




promote demote