Saturday, January 10, 2009

Finding Duplicates in SQL

First off, credit for this goes to Pete Freitag here: http://www.petefreitag.com/item/169.cfm

Anyways, if I haven't mentioned it before, I currently am a jack-of-all-trades at my place of employment (Digital Gateway). It's an ERP software company that mainly focuses on the copier industry. I currently work in support, but additionally handle training duties as well as some implementation and conversion duties. Officially when I graduate I will move full-time into implementation and conversions. I'm the highest you can get in our support department without going to a manager or a development team member. That means I work a lot with MSSQL, because that's the backend our software sits on.

Earlier this week, I had a ticket where the customer needed some custom SQL scripts run, as they had incorrectly labeled over 2,000 items, inserting a ',' instead of a '-' in their item numbers.

Obviously this is easy, right? Simply run a SQL replace statement:

UPDATE Items
SET ItemNumber = replace(ItemNumber,',','-')
WHERE ItemNumber LIKE '%,%'

Problem was that out of the 70,000 items in their database, they already had items that they had entered with hyphens, so running the script wouldn't work, because ItemNumber can't be duplicated. So, after fighting for a while, I ended up doing the following:

--DROP TABLE #ItemNumbers

CREATE TABLE #ItemNumbers(
ItemNumber VARCHAR(100) NOT NULL,
NewItemNumber VARCHAR(100) NOT NULL)

INSERT INTO #ItemNumbers(ItemNumber, NewItemNumber)
SELECT ItemNumber, REPLACE(ItemNumber,',','-')
FROM Items

Cool thing about this is that I now had a new temp table with only the old item number and the new item number, and if the item number didn't have a comma in it, then the ItemNumber and NewItemNumber were the same. So all that was left was finding the duplicates in the NewItemNumber row. That's where Pete came in. Modifying his query, I came up with:

SELECT ItemNumber,
COUNT(NewItemNumber) AS NewItemCount
FROM #ItemNumbers
GROUP BY ItemNumbers
HAVING ( COUNT(NewItemNumbers) > 1 )

and voila! I had it. Now all I needed to do was modify these 10 items and then run my original script.

No comments: