dbTalk Databases Forums  

Full Text Seach Performance

microsoft.public.sqlserver.fulltext microsoft.public.sqlserver.fulltext


Discuss Full Text Seach Performance in the microsoft.public.sqlserver.fulltext forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Ahsan
 
Posts: n/a

Default Full Text Seach Performance - 08-23-2004 , 04:53 AM






Operating System: Windows 2000 Server
Server Brand: DELL (Intel Penium 111, 1133 MHz)
Sql Server Standadrd Edition
Ram: 2GB
Processor: 1
Fulltext Index Size: 52MB
Number Of Rows In Table: 1.2 Million Aprox
SQL SERVER MEMORY SETTINGS: FIXED, 1945 MB
MINIMUM QUERY MEMORY SETTING: 1048576 KB
Average Records Processing Per Minute : 7.2
Average Records Processing Per Hour : 72
Target NUmber Of Rows: 1.1 Million

stored procedure i am using is used to remove duplications from the contacts
list, I altered the few parts of stored procedure and checked every statement
using profiler,the only one taking long time to execute is the FULL TEXT
SEARCHES one,


SELECT Contact_ID From Contact22 con
WHERE
con.Gender = @vOuterGender
AND (Con.Reported = 0 or con.reported is null)
AND Contains(con.zip,@vOuterZip)
AND Freetext(Address, @vOuterAddress)
AND Freetext(Con.Full_Name,@vOuterName)


as i mentioned that I have 1.1 million rows table on which fulltext is
enabled on three cloumns

Address,zip and Fullname (total size after full population is 52MB)

so what all my stored procedure does is take every record from the contact
table and matches it against others presnet in the table(1.1 million rows)
using fulltext searches, now these full text searches are taking time!!!


Currrenlty with all the settings I mentioned above my virtual memory is set
to 1.5 times then the actual RAM, my database and full text indexes are on
the same drive(thats the only choice) and it is execting at the speed of

20000 rows in 24 hours which means it required almost 55 days to finish,
unacceptable.....
can any one tell me that how to improve the performance in a senario like
mine for full text seaches? or is it the standard speed every one gets from
full text searches ?


Reply With Quote
  #2  
Old   
Hilary Cotter
 
Posts: n/a

Default Re: Full Text Seach Performance - 08-23-2004 , 08:01 AM






I'm not sure if full text search is the best tool for what you are doing.

The type of fuzzy search that SQL FTS uses while doing FreeText searches
will do linguistic stemming and expand the search to singular and plural
forms of the search phrase. So it seems to me you are doing a freetext
search to match

Mr. John Q. Public with J.Q.Public, J. Q. Public, John Public, etc. It
simply won't work for address and name.


It also seems that you are executing the query once for each entry you have
in your table. After you return matches, do you prune/consolidate these
matches from the table?

I also don't understand why you are doing a Contains query on the zip code?
Surely this should be an equality match or perhaps a match with left(zip,5)
=left(@vOuterZip, 5), or perhaps a case statement so you could avoid the
left(zip,5) and merely search on zip.

Lastly I strongly urge you to have a look at commercial products which do
exactly what you are looking for and have logic built into them to handle
all the possible cases of street name combination, or name variants, as well
as compensating for spelling errors, like the product in the below link.

http://www.name-searching.com/Correct_Address.html

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Ahsan" <Ahsan (AT) discussions (DOT) microsoft.com> wrote

Quote:
Operating System: Windows 2000 Server
Server Brand: DELL (Intel Penium 111, 1133 MHz)
Sql Server Standadrd Edition
Ram: 2GB
Processor: 1
Fulltext Index Size: 52MB
Number Of Rows In Table: 1.2 Million Aprox
SQL SERVER MEMORY SETTINGS: FIXED, 1945 MB
MINIMUM QUERY MEMORY SETTING: 1048576 KB
Average Records Processing Per Minute : 7.2
Average Records Processing Per Hour : 72
Target NUmber Of Rows: 1.1 Million

stored procedure i am using is used to remove duplications from the
contacts
list, I altered the few parts of stored procedure and checked every
statement
using profiler,the only one taking long time to execute is the FULL TEXT
SEARCHES one,


SELECT Contact_ID From Contact22 con
WHERE
con.Gender = @vOuterGender
AND (Con.Reported = 0 or con.reported is null)
AND Contains(con.zip,@vOuterZip)
AND Freetext(Address, @vOuterAddress)
AND Freetext(Con.Full_Name,@vOuterName)


as i mentioned that I have 1.1 million rows table on which fulltext is
enabled on three cloumns

Address,zip and Fullname (total size after full population is 52MB)

so what all my stored procedure does is take every record from the contact
table and matches it against others presnet in the table(1.1 million rows)
using fulltext searches, now these full text searches are taking time!!!


Currrenlty with all the settings I mentioned above my virtual memory is
set
to 1.5 times then the actual RAM, my database and full text indexes are on
the same drive(thats the only choice) and it is execting at the speed of

20000 rows in 24 hours which means it required almost 55 days to finish,
unacceptable.....
can any one tell me that how to improve the performance in a senario like
mine for full text seaches? or is it the standard speed every one gets
from
full text searches ?




Reply With Quote
  #3  
Old   
Hilary Cotter
 
Posts: n/a

Default Re: Full Text Seach Performance - 08-25-2004 , 10:02 AM



this is absolutely the wrong way to do this.

It will never match rd with road, or street with st or steert with street,
ave with avenue, John Smith, with J Smith, etc.

The strategy for handling this sort of a problem is to first clean up your
data, by changing all occurences of St. or st, to Street, similar with Road,
Avenue, Highway, etc. You will have to do something to put the names into a
common format.

After you have done this, you should start to group data by zip code, and
then examine the groupings to scrub further and remove duplicates. Then
group by city, examine the groupings and scrub further and remove
duplicates. Then group by street and scrub further and remove duplicates.

Then you can do the same procedure on last name to try to find people with
the same names, but who have moved.

Again I believe SQL FTS is the wrong tool for this, and there are commercial
programs out there which can do much better jobs faster then you could.

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Ahsan" <Ahsan (AT) discussions (DOT) microsoft.com> wrote

Quote:
Thanks for the reply Hilary,
Yes I am using this procedure to catch possible duplicates, as our web
site
allows a customer to enter its details more then one time, this leads to
duplicate records, and they can have transactions related against
different
customers ID's, My goal is to give marketing department a clean mailing
list
for their mail shots and merge related records of customers (as a customer
can be involved in differ rent processes under different ID's) so they
can
query using different criteria's, for this purpose I have created Full
Text
Indexes on 3 columns of the table in which My Contact information is
present,

Address
Zip
FullName

Now this table contains multiple entries for one customer ( as it is
allowed
by design and according to business requirements)

Example:

CUTOMER ID NAME ADDRESS ZIP
1 xyz 88 Corp Road SE4 1FG
2 xyz 88 Corp RD SE4 1 FG

Note: Road and RD, SE4 1NF and SE4 1 NF

so My try is to catch these matched and other records through full text
search, query through their related records in other tables Like their
orders, brochure requests and their participations in different events and
create two final tables

FinalContact (Master)
DenormalizedContactDetails (Detail)

With FinalContact populated with unique customers(no duplications) and
DenormalizedContactDetails with the activity related detail along with
their
duplicates ID's, I have tried this stored procedure with sample records
and
it works really fine and I am getting my desired output, the only problem
is
that full text search under cursor loop is taking to much time.
As you can see in the store procedure that I consolidates the matches as
well.
As I mentioned in records example that data quality for Zip is not good so
I
have to use Contain for it
We are already dealing with a company to remove duplication from our CRM,
but I am just working on providing mailing list so I have to deal with it
as
quick as possible

STORE PROCEDURE:

IF EXISTS (SELECT name FROM sysobjects
WHERE name = 'stp_PopulateDedupedContactsAlteredWithDetail' AND
type = 'P')
DROP PROCEDURE stp_PopulateDedupedContactsAlteredWithDetail
GO


CREATE PROCEDURE stp_PopulateDedupedContactsAlteredWithDetail
AS
SET QUOTED_IDENTIFIER OFF
SET NOCOUNT ON



DECLARE @vOuterContactID binary(8)
DECLARE @vOuterAddress varchar(50)
DECLARE @vOuterZip varchar(14)
DECLARE @vOuterName varchar(30)
DECLARE @vOuterGender varchar(6)
DECLARE @vOuterZipTemp varchar(14)

DECLARE @vContactID binary(8)
DECLARE @vPriorityCode int
DECLARE @vTitle varchar(9)
DECLARE @vFirstName varchar(30)
DECLARE @vLastName varchar(30)
DECLARE @vCity varchar(40)
DECLARE @vState varchar(30)
DECLARE @vCountry varchar(35)
DECLARE @vEmail varchar(50)
DECLARE @vPhone varchar(25)
DECLARE @vCell varchar(25)
DECLARE @vDoNotMail tinyint
DECLARE @vDoNotMobile tinyint
DECLARE @vDoNotAll tinyint
DECLARE @vDoNotEmail tinyint
DECLARE @vDPAConsent tinyint
DECLARE @vDPAConsentTP tinyint
DECLARE @vDPAQasked tinyint

Create Table #tTemp (
DedupId int IDENTITY (1, 1) NOT NULL,
Address varchar(50),
Zip varchar(12),
FullName varchar(30),
NTimes int,
Gender varchar(10),
Reported tinyint NULL
)
Insert into #tTemp
select con.Address,
con.ZIP,
con.full_name As "Full Name",
COUNT(con1.contact_Id) as "Number of Times In Contact Table",
con.Gender
from contact22 con
FUll Outer JOIN Contact22 con1 ON
con.zip = con1.zip
and
con.address = con1.address
and
con.full_name = con1.full_name
WHERE
con.contact_id = con1.contact_Id
AND con.reported = 0
group by con.Address,con.zip,con.full_name,Con.Gender
order by Con.Address,COUNT(con1.contact_Id) desc

Update #tTemp SET Reported = 0

CREATE NONCLUSTERED INDEX Temp_dedup_Index
ON #tTemp (DedupId)
WITH FILLFACTOR = 100



CREATE NONCLUSTERED INDEX Temp_Address_Index
ON #tTemp (Address)
WITH FILLFACTOR = 100

CREATE NONCLUSTERED INDEX Temp_Zip_Index
ON #tTemp (Zip)
WITH FILLFACTOR = 100


CREATE NONCLUSTERED INDEX Temp_FullName_Index
ON #tTemp (FullName)
WITH FILLFACTOR = 100

CREATE NONCLUSTERED INDEX Temp_Reported_Index
ON #tTemp (Reported)
WITH FILLFACTOR = 100



DECLARE @vLoop tinyint
SET @vLoop = 0




WHILE @vLoop =0
BEGIN

IF EXISTS (select count(contact_id) from contact22 where Reported= 0)
BEGIN

DECLARE cr_OuterLoop CURSOR FOR
SELECT TOP 1 Address,ZIP,FullName,Gender FROM #tTemp where Reported = 0
OPEN cr_OuterLoop
FETCH NEXT FROM cr_OuterLoop INTO
@vOuterAddress,@vOuterZip,@vOuterName,@vOuterGende r

WHILE (@@FETCH_STATUS <> -1)
BEGIN
SET @vOuterZipTemp = @vOuterZip
SET @vOuterZip = '"' + @vOuterZip + '"'
IF (@@FETCH_STATUS <> -2)

-- SET @vTemp =(select count(DedupId) from #tTemp where Reported reported
= 0)
-- Update DedupProcessesExecutionTime SET TimesExecuted = TimesExecuted +
1
--, Reported = @vTemp

DELETE FROM TempDdpSup

INSERT INTO TempDdpSup
SELECT Contact_ID From Contact22 con
WHERE
con.Gender = @vOuterGender
AND (Con.Reported = 0 or con.reported is null)
AND Contains(con.zip,@vOuterZip)
AND Freetext(Address, @vOuterAddress)
AND Freetext(Con.Full_Name,@vOuterName)

SELECT TOP 1 @vContactID = Con.Contact_Id,@vPriorityCode =
Con.Priority_Code,@vTitle = Con.Title,@vFirstName = Con.First_Name,
@vLastName = Con.Last_Name,@vCity = Con.City,@vState =
Con.State_,@vCountry=Con.Country,
@vEmail = Con.Email,@vPhone = Con.Phone,@vCell = Con.Cell,@vDoNotEmail
= Con.Do_Not_Use_Email,
@vDoNotMobile = Con.Do_Not_Use_Mobile, @vDoNotAll = Con.Do_Not_Use_All,
@vDoNotMail = Con.Do_Not_Mail,@vDPAConsent = Con.DPA_Consent,
@vDPAConsentTP = Con.DPA_Consent_TP,@vDPAQasked = Con.DPA_QAsked
FROM
Contact22 Con
WHERE
Con.Contact_id In (SELECT contact_id FROm TempDdpSup)
ORDER BY
Con.Rn_Create_Date DESC


INSERT INTO FinalContact
SELECT TOP 1

@vContactID,@vPriorityCode,@vTitle,@vFirstName,@vL astName,@vOuterName,@vOute
rGender,@vOuterAddress,@vOuterZipTemp,@vCity,
Quote:

@vState,@vCountry,@vEmail,@vPhone,@vCell,@vDoNotEm ail,@vDoNotMobile,@vDoNotA
ll,@vDoNotMail,@vDPAConsent,
Quote:
@vDPAConsentTP,@vDPAQasked,
NULL,NULL,0,
MAX(Ord.Order_Date) AS RecentPurchaseDate,
CASE Ord.Status WHEN 'Payment Taken' THEN Count(Ord.Order__Id) ELSE 0
END AS NumberOfOrders,
-- MAX(Bro.Sent_Date) AS RecentBrochureDate,
MAX(Book.Start_Date) AS RecentparticipantDate,
MAX(Bro.Request_Date) AS RecentBrochureRequestDate,
0
FROM TempDdpSup Con
LEFT OUTER JOIN Order_ Ord ON Con.Contact_Id = Ord.Bill_To_Contact_Id
LEFT OUTER JOIN Brochure Bro ON Con.Contact_Id = Bro.Contact_Id
LEFT OUTER JOIN Booking Book ON Con.Contact_Id = Book.Contact_Id
GROUP BY
Ord.Status
ORDER BY
MAX(Ord.Order_Date) DESC,
-- MAX(Bro.Sent_Date) DESC,
MAX(Book.Start_Date),
MAX(Bro.Request_Date)
--update #tTemp so next time looped row will not be selected
UPDATE #tTemp SET Reported = 1 WHERE
zip = @vOuterZipTemp
AND Address=@vOuterAddress
AND FullName =@vOuterName
AND Gender = @vOuterGender
--Update contact2 so next time these rows will not be selected
UPDATE Contact22 SET Reported = 1 WHERE
Contact_Id IN (SELECT contact_id FROm TempDdpSup)



--CODE FOR CONTACT DEATIL STARTS HERE
INSERT INTO DenormalizedContactDetails
SELECT @vContactID
,TempCon.contact_Id,Ord.Order_Date,Cha.Channel_Ide ntifier,
NULL,NULL,Ord.Sub_Total
FROM TempDdpSup TempCon
INNER JOIN Order_ Ord ON TempCon.Contact_Id = Ord.Bill_To_Contact_Id
INNER JOIN RLD_Order_Channels Cha ON Ord.Order_Channel_Id =
Cha.RLD_Order_Channels_Id
--insert related Brochure
INSERT INTO DenormalizedContactDetails
SELECT @vContactID,TempCon.contact_Id,NULL,NULL,NULL,
Bro.Request_Date,0
FROM TempDdpSup TempCon
INNER JOIN Brochure Bro ON TempCon.Contact_Id = Bro.Contact_Id
--insert related bookings (participent)
INSERT INTO DenormalizedContactDetails
SELECT @vContactID,TempCon.contact_Id,NULL,NULL,
Book.Start_Date,NULL,0
FROM TempDdpSup TempCon
INNER JOIN Booking Book ON TempCon.Contact_Id = Book.Contact_Id
/* --insert related recipient dates
INSERT INTO DenormalizedContactDetails
SELECT @vContactID,TempCon.contact_Id,NULL,NULL,
NULL,NULL,OrdD.Rn_Create_Date,0
FROM TempDdpSup TempCon
INNER JOIN _Order_Detail OrdD ON TempCon.Contact_Id = OrdD.Recipient
*/
--CODE FOR CONTACT DEATIL ENDS HERE



FETCH NEXT FROM cr_OuterLoop INTO
@vOuterAddress,@vOuterZip,@vOuterName,@vOuterGende r
END
CLOSE cr_OuterLoop
DEALLOCATE cr_OuterLoop
END
ELSE
BEGIN
SET @vLoop = 1
END
END



GO



I open for all the suggestions
Ahsan
"Hilary Cotter" wrote:

I'm not sure if full text search is the best tool for what you are
doing.

The type of fuzzy search that SQL FTS uses while doing FreeText searches
will do linguistic stemming and expand the search to singular and plural
forms of the search phrase. So it seems to me you are doing a freetext
search to match

Mr. John Q. Public with J.Q.Public, J. Q. Public, John Public, etc. It
simply won't work for address and name.


It also seems that you are executing the query once for each entry you
have
in your table. After you return matches, do you prune/consolidate these
matches from the table?

I also don't understand why you are doing a Contains query on the zip
code?
Surely this should be an equality match or perhaps a match with
left(zip,5)
=left(@vOuterZip, 5), or perhaps a case statement so you could avoid the
left(zip,5) and merely search on zip.

Lastly I strongly urge you to have a look at commercial products which
do
exactly what you are looking for and have logic built into them to
handle
all the possible cases of street name combination, or name variants, as
well
as compensating for spelling errors, like the product in the below link.

http://www.name-searching.com/Correct_Address.html

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Ahsan" <Ahsan (AT) discussions (DOT) microsoft.com> wrote in message
news:3DB4A82E-023B-43D5-BE38-042B41846B71 (AT) microsoft (DOT) com...
Operating System: Windows 2000 Server
Server Brand: DELL (Intel Penium 111, 1133 MHz)
Sql Server Standadrd Edition
Ram: 2GB
Processor: 1
Fulltext Index Size: 52MB
Number Of Rows In Table: 1.2 Million Aprox
SQL SERVER MEMORY SETTINGS: FIXED, 1945 MB
MINIMUM QUERY MEMORY SETTING: 1048576 KB
Average Records Processing Per Minute : 7.2
Average Records Processing Per Hour : 72
Target NUmber Of Rows: 1.1 Million

stored procedure i am using is used to remove duplications from the
contacts
list, I altered the few parts of stored procedure and checked every
statement
using profiler,the only one taking long time to execute is the FULL
TEXT
SEARCHES one,


SELECT Contact_ID From Contact22 con
WHERE
con.Gender = @vOuterGender
AND (Con.Reported = 0 or con.reported is null)
AND Contains(con.zip,@vOuterZip)
AND Freetext(Address, @vOuterAddress)
AND Freetext(Con.Full_Name,@vOuterName)


as i mentioned that I have 1.1 million rows table on which fulltext is
enabled on three cloumns

Address,zip and Fullname (total size after full population is 52MB)

so what all my stored procedure does is take every record from the
contact
table and matches it against others presnet in the table(1.1 million
rows)
using fulltext searches, now these full text searches are taking
time!!!


Currrenlty with all the settings I mentioned above my virtual memory
is
set
to 1.5 times then the actual RAM, my database and full text indexes
are on
the same drive(thats the only choice) and it is execting at the speed
of

20000 rows in 24 hours which means it required almost 55 days to
finish,
unacceptable.....
can any one tell me that how to improve the performance in a senario
like
mine for full text seaches? or is it the standard speed every one gets
from
full text searches ?








Reply With Quote
  #4  
Old   
John Kane
 
Posts: n/a

Default Re: Full Text Seach Performance - 08-25-2004 , 09:21 PM



Ahsan,
While I agree with Hilary that SQL FTS is not the best solution for this
task of data cleansing, you can use 3rd party software to clean-up and match
your SQL based textual data, for example:

Intelligent Search Technology - Searching and Matching Software
http://www.name-searching.com/

"NameSearch® - Search and matching software for enterprise systems."
"CorrectAddress® - Is (US CASS Certified) software that validates, corrects,
and enhances address information"
"MerlinMerge® Is a sophisticated list management tool that performs
duplicate record detection, merge/purge operations and can produce house
hold link determination."

While I've not (yet) tested this software, you can call their 800 number and
get a free 60-day trial to test it out...

Regards,
John




"Hilary Cotter" <hilaryk (AT) att (DOT) net> wrote

Quote:
this is absolutely the wrong way to do this.

It will never match rd with road, or street with st or steert with street,
ave with avenue, John Smith, with J Smith, etc.

The strategy for handling this sort of a problem is to first clean up your
data, by changing all occurences of St. or st, to Street, similar with
Road,
Avenue, Highway, etc. You will have to do something to put the names into
a
common format.

After you have done this, you should start to group data by zip code, and
then examine the groupings to scrub further and remove duplicates. Then
group by city, examine the groupings and scrub further and remove
duplicates. Then group by street and scrub further and remove duplicates.

Then you can do the same procedure on last name to try to find people with
the same names, but who have moved.

Again I believe SQL FTS is the wrong tool for this, and there are
commercial
programs out there which can do much better jobs faster then you could.

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Ahsan" <Ahsan (AT) discussions (DOT) microsoft.com> wrote in message
news:CB88EA56-710F-43E8-AFC1-BEDD35CD4A50 (AT) microsoft (DOT) com...
Thanks for the reply Hilary,
Yes I am using this procedure to catch possible duplicates, as our web
site
allows a customer to enter its details more then one time, this leads to
duplicate records, and they can have transactions related against
different
customers ID's, My goal is to give marketing department a clean mailing
list
for their mail shots and merge related records of customers (as a
customer
can be involved in differ rent processes under different ID's) so they
can
query using different criteria's, for this purpose I have created Full
Text
Indexes on 3 columns of the table in which My Contact information is
present,

Address
Zip
FullName

Now this table contains multiple entries for one customer ( as it is
allowed
by design and according to business requirements)

Example:

CUTOMER ID NAME ADDRESS ZIP
1 xyz 88 Corp Road SE4 1FG
2 xyz 88 Corp RD SE4 1 FG

Note: Road and RD, SE4 1NF and SE4 1 NF

so My try is to catch these matched and other records through full text
search, query through their related records in other tables Like their
orders, brochure requests and their participations in different events
and
create two final tables

FinalContact (Master)
DenormalizedContactDetails (Detail)

With FinalContact populated with unique customers(no duplications) and
DenormalizedContactDetails with the activity related detail along with
their
duplicates ID's, I have tried this stored procedure with sample records
and
it works really fine and I am getting my desired output, the only
problem
is
that full text search under cursor loop is taking to much time.
As you can see in the store procedure that I consolidates the matches as
well.
As I mentioned in records example that data quality for Zip is not good
so
I
have to use Contain for it
We are already dealing with a company to remove duplication from our
CRM,
but I am just working on providing mailing list so I have to deal with
it
as
quick as possible

STORE PROCEDURE:

IF EXISTS (SELECT name FROM sysobjects
WHERE name = 'stp_PopulateDedupedContactsAlteredWithDetail' AND
type = 'P')
DROP PROCEDURE stp_PopulateDedupedContactsAlteredWithDetail
GO


CREATE PROCEDURE stp_PopulateDedupedContactsAlteredWithDetail
AS
SET QUOTED_IDENTIFIER OFF
SET NOCOUNT ON



DECLARE @vOuterContactID binary(8)
DECLARE @vOuterAddress varchar(50)
DECLARE @vOuterZip varchar(14)
DECLARE @vOuterName varchar(30)
DECLARE @vOuterGender varchar(6)
DECLARE @vOuterZipTemp varchar(14)

DECLARE @vContactID binary(8)
DECLARE @vPriorityCode int
DECLARE @vTitle varchar(9)
DECLARE @vFirstName varchar(30)
DECLARE @vLastName varchar(30)
DECLARE @vCity varchar(40)
DECLARE @vState varchar(30)
DECLARE @vCountry varchar(35)
DECLARE @vEmail varchar(50)
DECLARE @vPhone varchar(25)
DECLARE @vCell varchar(25)
DECLARE @vDoNotMail tinyint
DECLARE @vDoNotMobile tinyint
DECLARE @vDoNotAll tinyint
DECLARE @vDoNotEmail tinyint
DECLARE @vDPAConsent tinyint
DECLARE @vDPAConsentTP tinyint
DECLARE @vDPAQasked tinyint

Create Table #tTemp (
DedupId int IDENTITY (1, 1) NOT NULL,
Address varchar(50),
Zip varchar(12),
FullName varchar(30),
NTimes int,
Gender varchar(10),
Reported tinyint NULL
)
Insert into #tTemp
select con.Address,
con.ZIP,
con.full_name As "Full Name",
COUNT(con1.contact_Id) as "Number of Times In Contact
Table",
con.Gender
from contact22 con
FUll Outer JOIN Contact22 con1 ON
con.zip = con1.zip
and
con.address = con1.address
and
con.full_name = con1.full_name
WHERE
con.contact_id = con1.contact_Id
AND con.reported = 0
group by con.Address,con.zip,con.full_name,Con.Gender
order by Con.Address,COUNT(con1.contact_Id) desc

Update #tTemp SET Reported = 0

CREATE NONCLUSTERED INDEX Temp_dedup_Index
ON #tTemp (DedupId)
WITH FILLFACTOR = 100



CREATE NONCLUSTERED INDEX Temp_Address_Index
ON #tTemp (Address)
WITH FILLFACTOR = 100

CREATE NONCLUSTERED INDEX Temp_Zip_Index
ON #tTemp (Zip)
WITH FILLFACTOR = 100


CREATE NONCLUSTERED INDEX Temp_FullName_Index
ON #tTemp (FullName)
WITH FILLFACTOR = 100

CREATE NONCLUSTERED INDEX Temp_Reported_Index
ON #tTemp (Reported)
WITH FILLFACTOR = 100



DECLARE @vLoop tinyint
SET @vLoop = 0




WHILE @vLoop =0
BEGIN

IF EXISTS (select count(contact_id) from contact22 where Reported= 0)
BEGIN

DECLARE cr_OuterLoop CURSOR FOR
SELECT TOP 1 Address,ZIP,FullName,Gender FROM #tTemp where Reported = 0
OPEN cr_OuterLoop
FETCH NEXT FROM cr_OuterLoop INTO
@vOuterAddress,@vOuterZip,@vOuterName,@vOuterGende r

WHILE (@@FETCH_STATUS <> -1)
BEGIN
SET @vOuterZipTemp = @vOuterZip
SET @vOuterZip = '"' + @vOuterZip + '"'
IF (@@FETCH_STATUS <> -2)

-- SET @vTemp =(select count(DedupId) from #tTemp where Reported
reported
= 0)
-- Update DedupProcessesExecutionTime SET TimesExecuted = TimesExecuted
+
1
--, Reported = @vTemp

DELETE FROM TempDdpSup

INSERT INTO TempDdpSup
SELECT Contact_ID From Contact22 con
WHERE
con.Gender = @vOuterGender
AND (Con.Reported = 0 or con.reported is null)
AND Contains(con.zip,@vOuterZip)
AND Freetext(Address, @vOuterAddress)
AND Freetext(Con.Full_Name,@vOuterName)

SELECT TOP 1 @vContactID = Con.Contact_Id,@vPriorityCode =
Con.Priority_Code,@vTitle = Con.Title,@vFirstName = Con.First_Name,
@vLastName = Con.Last_Name,@vCity = Con.City,@vState =
Con.State_,@vCountry=Con.Country,
@vEmail = Con.Email,@vPhone = Con.Phone,@vCell = Con.Cell,@vDoNotEmail
= Con.Do_Not_Use_Email,
@vDoNotMobile = Con.Do_Not_Use_Mobile, @vDoNotAll = Con.Do_Not_Use_All,
@vDoNotMail = Con.Do_Not_Mail,@vDPAConsent = Con.DPA_Consent,
@vDPAConsentTP = Con.DPA_Consent_TP,@vDPAQasked = Con.DPA_QAsked
FROM
Contact22 Con
WHERE
Con.Contact_id In (SELECT contact_id FROm TempDdpSup)
ORDER BY
Con.Rn_Create_Date DESC


INSERT INTO FinalContact
SELECT TOP 1


@vContactID,@vPriorityCode,@vTitle,@vFirstName,@vL astName,@vOuterName,@vOute
rGender,@vOuterAddress,@vOuterZipTemp,@vCity,



@vState,@vCountry,@vEmail,@vPhone,@vCell,@vDoNotEm ail,@vDoNotMobile,@vDoNotA
ll,@vDoNotMail,@vDPAConsent,
@vDPAConsentTP,@vDPAQasked,
NULL,NULL,0,
MAX(Ord.Order_Date) AS RecentPurchaseDate,
CASE Ord.Status WHEN 'Payment Taken' THEN Count(Ord.Order__Id) ELSE 0
END AS NumberOfOrders,
-- MAX(Bro.Sent_Date) AS RecentBrochureDate,
MAX(Book.Start_Date) AS RecentparticipantDate,
MAX(Bro.Request_Date) AS RecentBrochureRequestDate,
0
FROM TempDdpSup Con
LEFT OUTER JOIN Order_ Ord ON Con.Contact_Id = Ord.Bill_To_Contact_Id
LEFT OUTER JOIN Brochure Bro ON Con.Contact_Id = Bro.Contact_Id
LEFT OUTER JOIN Booking Book ON Con.Contact_Id = Book.Contact_Id
GROUP BY
Ord.Status
ORDER BY
MAX(Ord.Order_Date) DESC,
-- MAX(Bro.Sent_Date) DESC,
MAX(Book.Start_Date),
MAX(Bro.Request_Date)
--update #tTemp so next time looped row will not be selected
UPDATE #tTemp SET Reported = 1 WHERE
zip = @vOuterZipTemp
AND Address=@vOuterAddress
AND FullName =@vOuterName
AND Gender = @vOuterGender
--Update contact2 so next time these rows will not be selected
UPDATE Contact22 SET Reported = 1 WHERE
Contact_Id IN (SELECT contact_id FROm TempDdpSup)



--CODE FOR CONTACT DEATIL STARTS HERE
INSERT INTO DenormalizedContactDetails
SELECT @vContactID
,TempCon.contact_Id,Ord.Order_Date,Cha.Channel_Ide ntifier,
NULL,NULL,Ord.Sub_Total
FROM TempDdpSup TempCon
INNER JOIN Order_ Ord ON TempCon.Contact_Id = Ord.Bill_To_Contact_Id
INNER JOIN RLD_Order_Channels Cha ON Ord.Order_Channel_Id =
Cha.RLD_Order_Channels_Id
--insert related Brochure
INSERT INTO DenormalizedContactDetails
SELECT @vContactID,TempCon.contact_Id,NULL,NULL,NULL,
Bro.Request_Date,0
FROM TempDdpSup TempCon
INNER JOIN Brochure Bro ON TempCon.Contact_Id = Bro.Contact_Id
--insert related bookings (participent)
INSERT INTO DenormalizedContactDetails
SELECT @vContactID,TempCon.contact_Id,NULL,NULL,
Book.Start_Date,NULL,0
FROM TempDdpSup TempCon
INNER JOIN Booking Book ON TempCon.Contact_Id = Book.Contact_Id
/* --insert related recipient dates
INSERT INTO DenormalizedContactDetails
SELECT @vContactID,TempCon.contact_Id,NULL,NULL,
NULL,NULL,OrdD.Rn_Create_Date,0
FROM TempDdpSup TempCon
INNER JOIN _Order_Detail OrdD ON TempCon.Contact_Id = OrdD.Recipient
*/
--CODE FOR CONTACT DEATIL ENDS HERE



FETCH NEXT FROM cr_OuterLoop INTO
@vOuterAddress,@vOuterZip,@vOuterName,@vOuterGende r
END
CLOSE cr_OuterLoop
DEALLOCATE cr_OuterLoop
END
ELSE
BEGIN
SET @vLoop = 1
END
END



GO



I open for all the suggestions
Ahsan
"Hilary Cotter" wrote:

I'm not sure if full text search is the best tool for what you are
doing.

The type of fuzzy search that SQL FTS uses while doing FreeText search
es
will do linguistic stemming and expand the search to singular and
plural
forms of the search phrase. So it seems to me you are doing a freetext
search to match

Mr. John Q. Public with J.Q.Public, J. Q. Public, John Public, etc. It
simply won't work for address and name.


It also seems that you are executing the query once for each entry you
have
in your table. After you return matches, do you prune/consolidate
these
matches from the table?

I also don't understand why you are doing a Contains query on the zip
code?
Surely this should be an equality match or perhaps a match with
left(zip,5)
=left(@vOuterZip, 5), or perhaps a case statement so you could avoid
the
left(zip,5) and merely search on zip.

Lastly I strongly urge you to have a look at commercial products which
do
exactly what you are looking for and have logic built into them to
handle
all the possible cases of street name combination, or name variants,
as
well
as compensating for spelling errors, like the product in the below
link.

http://www.name-searching.com/Correct_Address.html

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Ahsan" <Ahsan (AT) discussions (DOT) microsoft.com> wrote in message
news:3DB4A82E-023B-43D5-BE38-042B41846B71 (AT) microsoft (DOT) com...
Operating System: Windows 2000 Server
Server Brand: DELL (Intel Penium 111, 1133 MHz)
Sql Server Standadrd Edition
Ram: 2GB
Processor: 1
Fulltext Index Size: 52MB
Number Of Rows In Table: 1.2 Million Aprox
SQL SERVER MEMORY SETTINGS: FIXED, 1945 MB
MINIMUM QUERY MEMORY SETTING: 1048576 KB
Average Records Processing Per Minute : 7.2
Average Records Processing Per Hour : 72
Target NUmber Of Rows: 1.1 Million

stored procedure i am using is used to remove duplications from the
contacts
list, I altered the few parts of stored procedure and checked every
statement
using profiler,the only one taking long time to execute is the FULL
TEXT
SEARCHES one,


SELECT Contact_ID From Contact22 con
WHERE
con.Gender = @vOuterGender
AND (Con.Reported = 0 or con.reported is null)
AND Contains(con.zip,@vOuterZip)
AND Freetext(Address, @vOuterAddress)
AND Freetext(Con.Full_Name,@vOuterName)


as i mentioned that I have 1.1 million rows table on which fulltext
is
enabled on three cloumns

Address,zip and Fullname (total size after full population is 52MB)

so what all my stored procedure does is take every record from the
contact
table and matches it against others presnet in the table(1.1 million
rows)
using fulltext searches, now these full text searches are taking
time!!!


Currrenlty with all the settings I mentioned above my virtual memory
is
set
to 1.5 times then the actual RAM, my database and full text indexes
are on
the same drive(thats the only choice) and it is execting at the
speed
of

20000 rows in 24 hours which means it required almost 55 days to
finish,
unacceptable.....
can any one tell me that how to improve the performance in a senario
like
mine for full text seaches? or is it the standard speed every one
gets
from
full text searches ?










Reply With Quote
  #5  
Old   
Hilary Cotter
 
Posts: n/a

Default Re: Full Text Seach Performance - 08-26-2004 , 06:51 AM



strange, that's the same link I provided in my first response to Ahsan

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"John Kane" <jt-kane (AT) comcast (DOT) net> wrote

Quote:
Ahsan,
While I agree with Hilary that SQL FTS is not the best solution for this
task of data cleansing, you can use 3rd party software to clean-up and
match
your SQL based textual data, for example:

Intelligent Search Technology - Searching and Matching Software
http://www.name-searching.com/

"NameSearch® - Search and matching software for enterprise systems."
"CorrectAddress® - Is (US CASS Certified) software that validates,
corrects,
and enhances address information"
"MerlinMerge® Is a sophisticated list management tool that performs
duplicate record detection, merge/purge operations and can produce house
hold link determination."

While I've not (yet) tested this software, you can call their 800 number
and
get a free 60-day trial to test it out...

Regards,
John




"Hilary Cotter" <hilaryk (AT) att (DOT) net> wrote in message
news:ODRfjTriEHA.2848 (AT) TK2MSFTNGP10 (DOT) phx.gbl...
this is absolutely the wrong way to do this.

It will never match rd with road, or street with st or steert with
street,
ave with avenue, John Smith, with J Smith, etc.

The strategy for handling this sort of a problem is to first clean up
your
data, by changing all occurences of St. or st, to Street, similar with
Road,
Avenue, Highway, etc. You will have to do something to put the names
into
a
common format.

After you have done this, you should start to group data by zip code,
and
then examine the groupings to scrub further and remove duplicates. Then
group by city, examine the groupings and scrub further and remove
duplicates. Then group by street and scrub further and remove
duplicates.

Then you can do the same procedure on last name to try to find people
with
the same names, but who have moved.

Again I believe SQL FTS is the wrong tool for this, and there are
commercial
programs out there which can do much better jobs faster then you could.

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Ahsan" <Ahsan (AT) discussions (DOT) microsoft.com> wrote in message
news:CB88EA56-710F-43E8-AFC1-BEDD35CD4A50 (AT) microsoft (DOT) com...
Thanks for the reply Hilary,
Yes I am using this procedure to catch possible duplicates, as our web
site
allows a customer to enter its details more then one time, this leads
to
duplicate records, and they can have transactions related against
different
customers ID's, My goal is to give marketing department a clean
mailing
list
for their mail shots and merge related records of customers (as a
customer
can be involved in differ rent processes under different ID's) so
they
can
query using different criteria's, for this purpose I have created Full
Text
Indexes on 3 columns of the table in which My Contact information is
present,

Address
Zip
FullName

Now this table contains multiple entries for one customer ( as it is
allowed
by design and according to business requirements)

Example:

CUTOMER ID NAME ADDRESS ZIP
1 xyz 88 Corp Road SE4 1FG
2 xyz 88 Corp RD SE4 1 FG

Note: Road and RD, SE4 1NF and SE4 1 NF

so My try is to catch these matched and other records through full
text
search, query through their related records in other tables Like their
orders, brochure requests and their participations in different events
and
create two final tables

FinalContact (Master)
DenormalizedContactDetails (Detail)

With FinalContact populated with unique customers(no duplications) and
DenormalizedContactDetails with the activity related detail along with
their
duplicates ID's, I have tried this stored procedure with sample
records
and
it works really fine and I am getting my desired output, the only
problem
is
that full text search under cursor loop is taking to much time.
As you can see in the store procedure that I consolidates the matches
as
well.
As I mentioned in records example that data quality for Zip is not
good
so
I
have to use Contain for it
We are already dealing with a company to remove duplication from our
CRM,
but I am just working on providing mailing list so I have to deal with
it
as
quick as possible

STORE PROCEDURE:

IF EXISTS (SELECT name FROM sysobjects
WHERE name = 'stp_PopulateDedupedContactsAlteredWithDetail'
AND
type = 'P')
DROP PROCEDURE stp_PopulateDedupedContactsAlteredWithDetail
GO


CREATE PROCEDURE stp_PopulateDedupedContactsAlteredWithDetail
AS
SET QUOTED_IDENTIFIER OFF
SET NOCOUNT ON



DECLARE @vOuterContactID binary(8)
DECLARE @vOuterAddress varchar(50)
DECLARE @vOuterZip varchar(14)
DECLARE @vOuterName varchar(30)
DECLARE @vOuterGender varchar(6)
DECLARE @vOuterZipTemp varchar(14)

DECLARE @vContactID binary(8)
DECLARE @vPriorityCode int
DECLARE @vTitle varchar(9)
DECLARE @vFirstName varchar(30)
DECLARE @vLastName varchar(30)
DECLARE @vCity varchar(40)
DECLARE @vState varchar(30)
DECLARE @vCountry varchar(35)
DECLARE @vEmail varchar(50)
DECLARE @vPhone varchar(25)
DECLARE @vCell varchar(25)
DECLARE @vDoNotMail tinyint
DECLARE @vDoNotMobile tinyint
DECLARE @vDoNotAll tinyint
DECLARE @vDoNotEmail tinyint
DECLARE @vDPAConsent tinyint
DECLARE @vDPAConsentTP tinyint
DECLARE @vDPAQasked tinyint

Create Table #tTemp (
DedupId int IDENTITY (1, 1) NOT NULL,
Address varchar(50),
Zip varchar(12),
FullName varchar(30),
NTimes int,
Gender varchar(10),
Reported tinyint NULL
)
Insert into #tTemp
select con.Address,
con.ZIP,
con.full_name As "Full Name",
COUNT(con1.contact_Id) as "Number of Times In Contact
Table",
con.Gender
from contact22 con
FUll Outer JOIN Contact22 con1 ON
con.zip = con1.zip
and
con.address = con1.address
and
con.full_name = con1.full_name
WHERE
con.contact_id = con1.contact_Id
AND con.reported = 0
group by con.Address,con.zip,con.full_name,Con.Gender
order by Con.Address,COUNT(con1.contact_Id) desc

Update #tTemp SET Reported = 0

CREATE NONCLUSTERED INDEX Temp_dedup_Index
ON #tTemp (DedupId)
WITH FILLFACTOR = 100



CREATE NONCLUSTERED INDEX Temp_Address_Index
ON #tTemp (Address)
WITH FILLFACTOR = 100

CREATE NONCLUSTERED INDEX Temp_Zip_Index
ON #tTemp (Zip)
WITH FILLFACTOR = 100


CREATE NONCLUSTERED INDEX Temp_FullName_Index
ON #tTemp (FullName)
WITH FILLFACTOR = 100

CREATE NONCLUSTERED INDEX Temp_Reported_Index
ON #tTemp (Reported)
WITH FILLFACTOR = 100



DECLARE @vLoop tinyint
SET @vLoop = 0




WHILE @vLoop =0
BEGIN

IF EXISTS (select count(contact_id) from contact22 where Reported= 0)
BEGIN

DECLARE cr_OuterLoop CURSOR FOR
SELECT TOP 1 Address,ZIP,FullName,Gender FROM #tTemp where Reported =
0
OPEN cr_OuterLoop
FETCH NEXT FROM cr_OuterLoop INTO
@vOuterAddress,@vOuterZip,@vOuterName,@vOuterGende r

WHILE (@@FETCH_STATUS <> -1)
BEGIN
SET @vOuterZipTemp = @vOuterZip
SET @vOuterZip = '"' + @vOuterZip + '"'
IF (@@FETCH_STATUS <> -2)

-- SET @vTemp =(select count(DedupId) from #tTemp where Reported
reported
= 0)
-- Update DedupProcessesExecutionTime SET TimesExecuted =
TimesExecuted
+
1
--, Reported = @vTemp

DELETE FROM TempDdpSup

INSERT INTO TempDdpSup
SELECT Contact_ID From Contact22 con
WHERE
con.Gender = @vOuterGender
AND (Con.Reported = 0 or con.reported is null)
AND Contains(con.zip,@vOuterZip)
AND Freetext(Address, @vOuterAddress)
AND Freetext(Con.Full_Name,@vOuterName)

SELECT TOP 1 @vContactID = Con.Contact_Id,@vPriorityCode =
Con.Priority_Code,@vTitle = Con.Title,@vFirstName = Con.First_Name,
@vLastName = Con.Last_Name,@vCity = Con.City,@vState =
Con.State_,@vCountry=Con.Country,
@vEmail = Con.Email,@vPhone = Con.Phone,@vCell = Con.Cell,@vDoNotEmail
= Con.Do_Not_Use_Email,
@vDoNotMobile = Con.Do_Not_Use_Mobile, @vDoNotAll =
Con.Do_Not_Use_All,
@vDoNotMail = Con.Do_Not_Mail,@vDPAConsent = Con.DPA_Consent,
@vDPAConsentTP = Con.DPA_Consent_TP,@vDPAQasked = Con.DPA_QAsked
FROM
Contact22 Con
WHERE
Con.Contact_id In (SELECT contact_id FROm TempDdpSup)
ORDER BY
Con.Rn_Create_Date DESC


INSERT INTO FinalContact
SELECT TOP 1



@vContactID,@vPriorityCode,@vTitle,@vFirstName,@vL astName,@vOuterName,@vOute
rGender,@vOuterAddress,@vOuterZipTemp,@vCity,




@vState,@vCountry,@vEmail,@vPhone,@vCell,@vDoNotEm ail,@vDoNotMobile,@vDoNotA
ll,@vDoNotMail,@vDPAConsent,
@vDPAConsentTP,@vDPAQasked,
NULL,NULL,0,
MAX(Ord.Order_Date) AS RecentPurchaseDate,
CASE Ord.Status WHEN 'Payment Taken' THEN Count(Ord.Order__Id) ELSE 0
END AS NumberOfOrders,
-- MAX(Bro.Sent_Date) AS RecentBrochureDate,
MAX(Book.Start_Date) AS RecentparticipantDate,
MAX(Bro.Request_Date) AS RecentBrochureRequestDate,
0
FROM TempDdpSup Con
LEFT OUTER JOIN Order_ Ord ON Con.Contact_Id = Ord.Bill_To_Contact_Id
LEFT OUTER JOIN Brochure Bro ON Con.Contact_Id = Bro.Contact_Id
LEFT OUTER JOIN Booking Book ON Con.Contact_Id = Book.Contact_Id
GROUP BY
Ord.Status
ORDER BY
MAX(Ord.Order_Date) DESC,
-- MAX(Bro.Sent_Date) DESC,
MAX(Book.Start_Date),
MAX(Bro.Request_Date)
--update #tTemp so next time looped row will not be selected
UPDATE #tTemp SET Reported = 1 WHERE
zip = @vOuterZipTemp
AND Address=@vOuterAddress
AND FullName =@vOuterName
AND Gender = @vOuterGender
--Update contact2 so next time these rows will not be selected
UPDATE Contact22 SET Reported = 1 WHERE
Contact_Id IN (SELECT contact_id FROm TempDdpSup)



--CODE FOR CONTACT DEATIL STARTS HERE
INSERT INTO DenormalizedContactDetails
SELECT @vContactID
,TempCon.contact_Id,Ord.Order_Date,Cha.Channel_Ide ntifier,
NULL,NULL,Ord.Sub_Total
FROM TempDdpSup TempCon
INNER JOIN Order_ Ord ON TempCon.Contact_Id = Ord.Bill_To_Contact_Id
INNER JOIN RLD_Order_Channels Cha ON Ord.Order_Channel_Id =
Cha.RLD_Order_Channels_Id
--insert related Brochure
INSERT INTO DenormalizedContactDetails
SELECT @vContactID,TempCon.contact_Id,NULL,NULL,NULL,
Bro.Request_Date,0
FROM TempDdpSup TempCon
INNER JOIN Brochure Bro ON TempCon.Contact_Id = Bro.Contact_Id
--insert related bookings (participent)
INSERT INTO DenormalizedContactDetails
SELECT @vContactID,TempCon.contact_Id,NULL,NULL,
Book.Start_Date,NULL,0
FROM TempDdpSup TempCon
INNER JOIN Booking Book ON TempCon.Contact_Id = Book.Contact_Id
/* --insert related recipient dates
INSERT INTO DenormalizedContactDetails
SELECT @vContactID,TempCon.contact_Id,NULL,NULL,
NULL,NULL,OrdD.Rn_Create_Date,0
FROM TempDdpSup TempCon
INNER JOIN _Order_Detail OrdD ON TempCon.Contact_Id = OrdD.Recipient
*/
--CODE FOR CONTACT DEATIL ENDS HERE



FETCH NEXT FROM cr_OuterLoop INTO
@vOuterAddress,@vOuterZip,@vOuterName,@vOuterGende r
END
CLOSE cr_OuterLoop
DEALLOCATE cr_OuterLoop
END
ELSE
BEGIN
SET @vLoop = 1
END
END



GO



I open for all the suggestions
Ahsan
"Hilary Cotter" wrote:

I'm not sure if full text search is the best tool for what you are
doing.

The type of fuzzy search that SQL FTS uses while doing FreeText
search
es
will do linguistic stemming and expand the search to singular and
plural
forms of the search phrase. So it seems to me you are doing a
freetext
search to match

Mr. John Q. Public with J.Q.Public, J. Q. Public, John Public, etc.
It
simply won't work for address and name.


It also seems that you are executing the query once for each entry
you
have
in your table. After you return matches, do you prune/consolidate
these
matches from the table?

I also don't understand why you are doing a Contains query on the
zip
code?
Surely this should be an equality match or perhaps a match with
left(zip,5)
=left(@vOuterZip, 5), or perhaps a case statement so you could avoid
the
left(zip,5) and merely search on zip.

Lastly I strongly urge you to have a look at commercial products
which
do
exactly what you are looking for and have logic built into them to
handle
all the possible cases of street name combination, or name variants,
as
well
as compensating for spelling errors, like the product in the below
link.

http://www.name-searching.com/Correct_Address.html

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Ahsan" <Ahsan (AT) discussions (DOT) microsoft.com> wrote in message
news:3DB4A82E-023B-43D5-BE38-042B41846B71 (AT) microsoft (DOT) com...
Operating System: Windows 2000 Server
Server Brand: DELL (Intel Penium 111, 1133 MHz)
Sql Server Standadrd Edition
Ram: 2GB
Processor: 1
Fulltext Index Size: 52MB
Number Of Rows In Table: 1.2 Million Aprox
SQL SERVER MEMORY SETTINGS: FIXED, 1945 MB
MINIMUM QUERY MEMORY SETTING: 1048576 KB
Average Records Processing Per Minute : 7.2
Average Records Processing Per Hour : 72
Target NUmber Of Rows: 1.1 Million

stored procedure i am using is used to remove duplications from
the
contacts
list, I altered the few parts of stored procedure and checked
every
statement
using profiler,the only one taking long time to execute is the
FULL
TEXT
SEARCHES one,


SELECT Contact_ID From Contact22 con
WHERE
con.Gender = @vOuterGender
AND (Con.Reported = 0 or con.reported is null)
AND Contains(con.zip,@vOuterZip)
AND Freetext(Address, @vOuterAddress)
AND Freetext(Con.Full_Name,@vOuterName)


as i mentioned that I have 1.1 million rows table on which
fulltext
is
enabled on three cloumns

Address,zip and Fullname (total size after full population is
52MB)

so what all my stored procedure does is take every record from the
contact
table and matches it against others presnet in the table(1.1
million
rows)
using fulltext searches, now these full text searches are taking
time!!!


Currrenlty with all the settings I mentioned above my virtual
memory
is
set
to 1.5 times then the actual RAM, my database and full text
indexes
are on
the same drive(thats the only choice) and it is execting at the
speed
of

20000 rows in 24 hours which means it required almost 55 days to
finish,
unacceptable.....
can any one tell me that how to improve the performance in a
senario
like
mine for full text seaches? or is it the standard speed every one
gets
from
full text searches ?












Reply With Quote
  #6  
Old   
John Kane
 
Posts: n/a

Default Re: Full Text Seach Performance - 08-26-2004 , 09:08 AM



Well, then, I guess the saying "great minds think alike" is true! ;-)


"Hilary Cotter" <hilaryk (AT) att (DOT) net> wrote

Quote:
strange, that's the same link I provided in my first response to Ahsan

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"John Kane" <jt-kane (AT) comcast (DOT) net> wrote in message
news:OhiyyNxiEHA.140 (AT) TK2MSFTNGP12 (DOT) phx.gbl...
Ahsan,
While I agree with Hilary that SQL FTS is not the best solution for this
task of data cleansing, you can use 3rd party software to clean-up and
match
your SQL based textual data, for example:

Intelligent Search Technology - Searching and Matching Software
http://www.name-searching.com/

"NameSearch® - Search and matching software for enterprise systems."
"CorrectAddress® - Is (US CASS Certified) software that validates,
corrects,
and enhances address information"
"MerlinMerge® Is a sophisticated list management tool that performs
duplicate record detection, merge/purge operations and can produce house
hold link determination."

While I've not (yet) tested this software, you can call their 800 number
and
get a free 60-day trial to test it out...

Regards,
John




"Hilary Cotter" <hilaryk (AT) att (DOT) net> wrote in message
news:ODRfjTriEHA.2848 (AT) TK2MSFTNGP10 (DOT) phx.gbl...
this is absolutely the wrong way to do this.

It will never match rd with road, or street with st or steert with
street,
ave with avenue, John Smith, with J Smith, etc.

The strategy for handling this sort of a problem is to first clean up
your
data, by changing all occurences of St. or st, to Street, similar with
Road,
Avenue, Highway, etc. You will have to do something to put the names
into
a
common format.

After you have done this, you should start to group data by zip code,
and
then examine the groupings to scrub further and remove duplicates.
Then
group by city, examine the groupings and scrub further and remove
duplicates. Then group by street and scrub further and remove
duplicates.

Then you can do the same procedure on last name to try to find people
with
the same names, but who have moved.

Again I believe SQL FTS is the wrong tool for this, and there are
commercial
programs out there which can do much better jobs faster then you
could.

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Ahsan" <Ahsan (AT) discussions (DOT) microsoft.com> wrote in message
news:CB88EA56-710F-43E8-AFC1-BEDD35CD4A50 (AT) microsoft (DOT) com...
Thanks for the reply Hilary,
Yes I am using this procedure to catch possible duplicates, as our
web
site
allows a customer to enter its details more then one time, this
leads
to
duplicate records, and they can have transactions related against
different
customers ID's, My goal is to give marketing department a clean
mailing
list
for their mail shots and merge related records of customers (as a
customer
can be involved in differ rent processes under different ID's) so
they
can
query using different criteria's, for this purpose I have created
Full
Text
Indexes on 3 columns of the table in which My Contact information is
present,

Address
Zip
FullName

Now this table contains multiple entries for one customer ( as it is
allowed
by design and according to business requirements)

Example:

CUTOMER ID NAME ADDRESS ZIP
1 xyz 88 Corp Road SE4 1FG
2 xyz 88 Corp RD SE4 1 FG

Note: Road and RD, SE4 1NF and SE4 1 NF

so My try is to catch these matched and other records through full
text
search, query through their related records in other tables Like
their
orders, brochure requests and their participations in different
events
and
create two final tables

FinalContact (Master)
DenormalizedContactDetails (Detail)

With FinalContact populated with unique customers(no duplications)
and
DenormalizedContactDetails with the activity related detail along
with
their
duplicates ID's, I have tried this stored procedure with sample
records
and
it works really fine and I am getting my desired output, the only
problem
is
that full text search under cursor loop is taking to much time.
As you can see in the store procedure that I consolidates the
matches
as
well.
As I mentioned in records example that data quality for Zip is not
good
so
I
have to use Contain for it
We are already dealing with a company to remove duplication from our
CRM,
but I am just working on providing mailing list so I have to deal
with
it
as
quick as possible

STORE PROCEDURE:

IF EXISTS (SELECT name FROM sysobjects
WHERE name = 'stp_PopulateDedupedContactsAlteredWithDetail'
AND
type = 'P')
DROP PROCEDURE stp_PopulateDedupedContactsAlteredWithDetail
GO


CREATE PROCEDURE stp_PopulateDedupedContactsAlteredWithDetail
AS
SET QUOTED_IDENTIFIER OFF
SET NOCOUNT ON



DECLARE @vOuterContactID binary(8)
DECLARE @vOuterAddress varchar(50)
DECLARE @vOuterZip varchar(14)
DECLARE @vOuterName varchar(30)
DECLARE @vOuterGender varchar(6)
DECLARE @vOuterZipTemp varchar(14)

DECLARE @vContactID binary(8)
DECLARE @vPriorityCode int
DECLARE @vTitle varchar(9)
DECLARE @vFirstName varchar(30)
DECLARE @vLastName varchar(30)
DECLARE @vCity varchar(40)
DECLARE @vState varchar(30)
DECLARE @vCountry varchar(35)
DECLARE @vEmail varchar(50)
DECLARE @vPhone varchar(25)
DECLARE @vCell varchar(25)
DECLARE @vDoNotMail tinyint
DECLARE @vDoNotMobile tinyint
DECLARE @vDoNotAll tinyint
DECLARE @vDoNotEmail tinyint
DECLARE @vDPAConsent tinyint
DECLARE @vDPAConsentTP tinyint
DECLARE @vDPAQasked tinyint

Create Table #tTemp (
DedupId int IDENTITY (1, 1) NOT NULL,
Address varchar(50),
Zip varchar(12),
FullName varchar(30),
NTimes int,
Gender varchar(10),
Reported tinyint NULL
)
Insert into #tTemp
select con.Address,
con.ZIP,
con.full_name As "Full Name",
COUNT(con1.contact_Id) as "Number of Times In Contact
Table",
con.Gender
from contact22 con
FUll Outer JOIN Contact22 con1 ON
con.zip = con1.zip
and
con.address = con1.address
and
con.full_name = con1.full_name
WHERE
con.contact_id = con1.contact_Id
AND con.reported = 0
group by con.Address,con.zip,con.full_name,Con.Gender
order by Con.Address,COUNT(con1.contact_Id) desc

Update #tTemp SET Reported = 0

CREATE NONCLUSTERED INDEX Temp_dedup_Index
ON #tTemp (DedupId)
WITH FILLFACTOR = 100



CREATE NONCLUSTERED INDEX Temp_Address_Index
ON #tTemp (Address)
WITH FILLFACTOR = 100

CREATE NONCLUSTERED INDEX Temp_Zip_Index
ON #tTemp (Zip)
WITH FILLFACTOR = 100


CREATE NONCLUSTERED INDEX Temp_FullName_Index
ON #tTemp (FullName)
WITH FILLFACTOR = 100

CREATE NONCLUSTERED INDEX Temp_Reported_Index
ON #tTemp (Reported)
WITH FILLFACTOR = 100



DECLARE @vLoop tinyint
SET @vLoop = 0




WHILE @vLoop =0
BEGIN

IF EXISTS (select count(contact_id) from contact22 where Reported=
0)
BEGIN

DECLARE cr_OuterLoop CURSOR FOR
SELECT TOP 1 Address,ZIP,FullName,Gender FROM #tTemp where Reported
=
0
OPEN cr_OuterLoop
FETCH NEXT FROM cr_OuterLoop INTO
@vOuterAddress,@vOuterZip,@vOuterName,@vOuterGende r

WHILE (@@FETCH_STATUS <> -1)
BEGIN
SET @vOuterZipTemp = @vOuterZip
SET @vOuterZip = '"' + @vOuterZip + '"'
IF (@@FETCH_STATUS <> -2)

-- SET @vTemp =(select count(DedupId) from #tTemp where Reported
reported
= 0)
-- Update DedupProcessesExecutionTime SET TimesExecuted =
TimesExecuted
+
1
--, Reported = @vTemp

DELETE FROM TempDdpSup

INSERT INTO TempDdpSup
SELECT Contact_ID From Contact22 con
WHERE
con.Gender = @vOuterGender
AND (Con.Reported = 0 or con.reported is null)
AND Contains(con.zip,@vOuterZip)
AND Freetext(Address, @vOuterAddress)
AND Freetext(Con.Full_Name,@vOuterName)

SELECT TOP 1 @vContactID = Con.Contact_Id,@vPriorityCode =
Con.Priority_Code,@vTitle = Con.Title,@vFirstName = Con.First_Name,
@vLastName = Con.Last_Name,@vCity = Con.City,@vState =
Con.State_,@vCountry=Con.Country,
@vEmail = Con.Email,@vPhone = Con.Phone,@vCell =
Con.Cell,@vDoNotEmail
= Con.Do_Not_Use_Email,
@vDoNotMobile = Con.Do_Not_Use_Mobile, @vDoNotAll =
Con.Do_Not_Use_All,
@vDoNotMail = Con.Do_Not_Mail,@vDPAConsent = Con.DPA_Consent,
@vDPAConsentTP = Con.DPA_Consent_TP,@vDPAQasked = Con.DPA_QAsked
FROM
Contact22 Con
WHERE
Con.Contact_id In (SELECT contact_id FROm TempDdpSup)
ORDER BY
Con.Rn_Create_Date DESC


INSERT INTO FinalContact
SELECT TOP 1




@vContactID,@vPriorityCode,@vTitle,@vFirstName,@vL astName,@vOuterName,@vOute
rGender,@vOuterAddress,@vOuterZipTemp,@vCity,





@vState,@vCountry,@vEmail,@vPhone,@vCell,@vDoNotEm ail,@vDoNotMobile,@vDoNotA
ll,@vDoNotMail,@vDPAConsent,
@vDPAConsentTP,@vDPAQasked,
NULL,NULL,0,
MAX(Ord.Order_Date) AS RecentPurchaseDate,
CASE Ord.Status WHEN 'Payment Taken' THEN Count(Ord.Order__Id) ELSE
0
END AS NumberOfOrders,
-- MAX(Bro.Sent_Date) AS RecentBrochureDate,
MAX(Book.Start_Date) AS RecentparticipantDate,
MAX(Bro.Request_Date) AS RecentBrochureRequestDate,
0
FROM TempDdpSup Con
LEFT OUTER JOIN Order_ Ord ON Con.Contact_Id =
Ord.Bill_To_Contact_Id
LEFT OUTER JOIN Brochure Bro ON Con.Contact_Id = Bro.Contact_Id
LEFT OUTER JOIN Booking Book ON Con.Contact_Id = Book.Contact_Id
GROUP BY
Ord.Status
ORDER BY
MAX(Ord.Order_Date) DESC,
-- MAX(Bro.Sent_Date) DESC,
MAX(Book.Start_Date),
MAX(Bro.Request_Date)
--update #tTemp so next time looped row will not be selected
UPDATE #tTemp SET Reported = 1 WHERE
zip = @vOuterZipTemp
AND Address=@vOuterAddress
AND FullName =@vOuterName
AND Gender = @vOuterGender
--Update contact2 so next time these rows will not be selected
UPDATE Contact22 SET Reported = 1 WHERE
Contact_Id IN (SELECT contact_id FROm TempDdpSup)



--CODE FOR CONTACT DEATIL STARTS HERE
INSERT INTO DenormalizedContactDetails
SELECT @vContactID
,TempCon.contact_Id,Ord.Order_Date,Cha.Channel_Ide ntifier,
NULL,NULL,Ord.Sub_Total
FROM TempDdpSup TempCon
INNER JOIN Order_ Ord ON TempCon.Contact_Id = Ord.Bill_To_Contact_Id
INNER JOIN RLD_Order_Channels Cha ON Ord.Order_Channel_Id =
Cha.RLD_Order_Channels_Id
--insert related Brochure
INSERT INTO DenormalizedContactDetails
SELECT @vContactID,TempCon.contact_Id,NULL,NULL,NULL,
Bro.Request_Date,0
FROM TempDdpSup TempCon
INNER JOIN Brochure Bro ON TempCon.Contact_Id = Bro.Contact_Id
--insert related bookings (participent)
INSERT INTO DenormalizedContactDetails
SELECT @vContactID,TempCon.contact_Id,NULL,NULL,
Book.Start_Date,NULL,0
FROM TempDdpSup TempCon
INNER JOIN Booking Book ON TempCon.Contact_Id = Book.Contact_Id
/* --insert related recipient dates
INSERT INTO DenormalizedContactDetails
SELECT @vContactID,TempCon.contact_Id,NULL,NULL,
NULL,NULL,OrdD.Rn_Create_Date,0
FROM TempDdpSup TempCon
INNER JOIN _Order_Detail OrdD ON TempCon.Contact_Id = OrdD.Recipient
*/
--CODE FOR CONTACT DEATIL ENDS HERE



FETCH NEXT FROM cr_OuterLoop INTO
@vOuterAddress,@vOuterZip,@vOuterName,@vOuterGende r
END
CLOSE cr_OuterLoop
DEALLOCATE cr_OuterLoop
END
ELSE
BEGIN
SET @vLoop = 1
END
END



GO



I open for all the suggestions
Ahsan
"Hilary Cotter" wrote:

I'm not sure if full text search is the best tool for what you are
doing.

The type of fuzzy search that SQL FTS uses while doing FreeText
search
es
will do linguistic stemming and expand the search to singular and
plural
forms of the search phrase. So it seems to me you are doing a
freetext
search to match

Mr. John Q. Public with J.Q.Public, J. Q. Public, John Public,
etc.
It
simply won't work for address and name.


It also seems that you are executing the query once for each entry
you
have
in your table. After you return matches, do you prune/consolidate
these
matches from the table?

I also don't understand why you are doing a Contains query on the
zip
code?
Surely this should be an equality match or perhaps a match with
left(zip,5)
=left(@vOuterZip, 5), or perhaps a case statement so you could
avoid
the
left(zip,5) and merely search on zip.

Lastly I strongly urge you to have a look at commercial products
which
do
exactly what you are looking for and have logic built into them to
handle
all the possible cases of street name combination, or name
variants,
as
well
as compensating for spelling errors, like the product in the below
link.

http://www.name-searching.com/Correct_Address.html

--
Hilary Cotter
Looking for a book on SQL Server replication?
http://www.nwsu.com/0974973602.html


"Ahsan" <Ahsan (AT) discussions (DOT) microsoft.com> wrote in message
news:3DB4A82E-023B-43D5-BE38-042B41846B71 (AT) microsoft (DOT) com...
Operating System: Windows 2000 Server
Server Brand: DELL (Intel Penium 111, 1133 MHz)
Sql Server Standadrd Edition
Ram: 2GB
Processor: 1
Fulltext Index Size: 52MB
Number Of Rows In Table: 1.2 Million Aprox
SQL SERVER MEMORY SETTINGS: FIXED, 1945 MB
MINIMUM QUERY MEMORY SETTING: 1048576 KB
Average Records Processing Per Minute : 7.2
Average Records Processing Per Hour : 72
Target NUmber Of Rows: 1.1 Million

stored procedure i am using is used to remove duplications from
the
contacts
list, I altered the few parts of stored procedure and checked
every
statement
using profiler,the only one taking long time to execute is the
FULL
TEXT
SEARCHES one,


SELECT Contact_ID From Contact22 con
WHERE
con.Gender = @vOuterGender
AND (Con.Reported = 0 or con.reported is null)
AND Contains(con.zip,@vOuterZip)
AND Freetext(Address, @vOuterAddress)
AND Freetext(Con.Full_Name,@vOuterName)


as i mentioned that I have 1.1 million rows table on which
fulltext
is
enabled on three cloumns

Address,zip and Fullname (total size after full population is
52MB)

so what all my stored procedure does is take every record from
the
contact
table and matches it against others presnet in the table(1.1
million
rows)
using fulltext searches, now these full text searches are taking
time!!!


Currrenlty with all the settings I mentioned above my virtual
memory
is
set
to 1.5 times then the actual RAM, my database and full text
indexes
are on
the same drive(thats the only choice) and it is execting at the
speed
of

20000 rows in 24 hours which means it required almost 55 days to
finish,
unacceptable.....
can any one tell me that how to improve the performance in a
senario
like
mine for full text seaches? or is it the standard speed every
one
gets
from
full text searches ?














Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.