Using the KingswaySoft Duplicate Detector
KingswaySoft recently released their version 1.0 of the Productivity Pack which has some awesome features – and I will focus on one in particular in this blog post. The Duplicate Detector. I’ve already used this tool on a few customers, and it makes finding potential duplicates in data a lot easier.
I’ve used it in two different scenarioes:
- Scanning for existing duplicates in a CRM system
- Scanning for potential duplicates from import data sheet, where I compared the data to the CRM system.
In the first scenario, the setup is rather straight forward. You need a source, the duplicate detector component and a result destination (I’ve prefer using an Excel sheet for this).
The duplicate detection is setup in the fasion, where you select the column(s) you wish to match on. You can then select the matching type and the precision (1 to 100%). Most of the match types are self-explanatory, but I would suggest you try the different ones, and see which one works for your scenario. If you are look for account name duplicates, then I would ofcause suggest using the Company Name Match.
The most important part of the duplicate detection is however the results, and how those are presented. The way KingswaySofts duplicate detector does it, is with a Group Code. If I have two accounts called AA and A.a, those would be grouped with the same GroupCode – fx 154566121. This is also the matching criteria, which you can sort the duplicates on.
GroupCode | Status | AccountId | Name |
100536828 | Active | {DCE400A1-8335-E611-80E1-5065F38ADA41} | A.a |
100536828 | Active | {F98976AB-3594-E611-80F1-5065F38A5A21} | AA |
1005414176 | Active | {0A5ED241-8335-E611-80E1-5065F38ADA41} | BOSS GmbH |
1005414176 | Active | {83BF2D9C-6C97-E611-80EE-5065F38ADAE1} | GmbH BOSS |
1041043548 | Active | {77F64167-8235-E611-80E1-5065F38ADA41} | Bbas |
1041043548 | Active | {C700A7AF-7397-E611-80F0-5065F38B4681} | BBAS SERVICES |
What you do with the results, is up to you. It can be deleted, merged, or something completely different.
In the second scenario, the data are not yet in CRM, but we still want to compare the data to the data in CRM. I will need two sources, and a Union all to combine them. I mash all the data through the Duplicate detector, to get the results I need.
When using this method, though, I will get alot more rows than I need, since the CRM source checks the whole CRM account database – but I only want the ones connected to my import data.
To get those results, I have anoter step in my job, which filters the duplicates:
SELECT A1.Name, A1.AccountId, A1.GroupCode FROM [Duplicates_Full$] as A1 WHERE A1.GroupCode IN (SELECT A2.GroupCode FROM [Duplicates_Full$] as A2 WHERE A2.AccountId IS NULL) AND (SELECT COUNT (*) FROM [Duplicates_Full$] A3 WHERE A3.GroupCode = A1.GroupCode GROUP BY A3.GroupCode) > 1
With that query in the Excel source, I can filter all those duplicates which have at least one row without an AccountId (are from the import data).
You can download the KingswaySoft duplicate detector and all the other wonderful toolkits in their productivitypack on their website here.