dbTalk Databases Forums  

Need help with searching PDF files stored in SQL

microsoft.public.sqlserver.fulltext microsoft.public.sqlserver.fulltext


Discuss Need help with searching PDF files stored in SQL in the microsoft.public.sqlserver.fulltext forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Louie
 
Posts: n/a

Default Need help with searching PDF files stored in SQL - 07-05-2004 , 12:42 AM






I will explain step by step what I have done, hope it will make it
easier for the gurus to solve my problem.

Requirement: Be able to search PDF files stored in SQL.
SQL Server: 2000 (SP3)
Window: 2000 Server

1. Installed Acrobat iFilter 5.0
2. Created a table to store PDFs.
creat table PDFFiles
(
FileID int,
PDF image,
DocType char(4),
constraint pk_pdffiles primary key
(
FileID
)
)

3. Set up the table for fulltext search.
exec sp_fulltext_table 'pdffiles', 'create', 'pdf', 'pk_pdffiles'
exec sp_fulltext_column 'PDFFiles', 'pdf', 'add', default, 'DocType'

4. Insert PDFs into the table.
Done by a custom app I have written. To verify that the PDF was
inserted correctly, I used the app to grab the PDF out and I could
open the file in Acrobat Reader successfully.

The table looks like something below:
FileID PDF DocType
1 0x255044462D312E330D0A25E2E3CFD30D0A3134... .pdf

5. Populate the index
exec sp_fulltext_table 'pdffiles', 'start_full'

The population process only took a few seconds. The following is what
event viewer's application log said:

The end of crawl for project <SQLServer$TEST SQL0001200006> has been
detected. The Gatherer successfully processed 2 documents totaling 0K.
It failed to filter 0 documents. 0 URLs could not be reached or were
denied access.

By double clicking on PDF catalog in Enterprise Manager, I got the
following info:

Status: Idle
Item Count: 2 (note I only inserted one PDF file)
Catalog Size: 1 MB
Unique Key Count: 737

6. Do a query

select * from pdffiles
where contains(pdf, 'possible')

The query returned nothing. I tried several other keywords but all
have failed.

What have I done wrong or missed?
Thank you.

ps. the pdf file in binary form is attached below

0x255044462D312E330D0A25E2E3CFD30D0A3134352030206F 626A0D0A3C3C0D0A2F4C696E656172697A656420310D0A2F4C 203537393038300D0A2F48205B203133333420343134205D0D 0A2F4F203134370D0A2F452037383438370D0A2F4E2033390D 0A2F54203537363035320D0A3E3E0D0A656E646F626A0D0A20 202020

Reply With Quote
  #2  
Old   
John Kane
 
Posts: n/a

Default Re: Need help with searching PDF files stored in SQL - 07-06-2004 , 12:00 AM






Louie,
I didn't get the PDF file that you attached. Could you either reply with the
PDF or you can email it to me directly?

Thanks,
John



"Louie" <anonymous (AT) devdex (DOT) com> wrote

Quote:
John, thank you for your reply.

I have tried textcopy.exe, it doesn't seem to matter. What I am curious
about is the type column name. The definition from MSDN doesn't say
whether it should be defined as char(4) '.pdf' or char(3) 'pdf'.

[@type_colname =] 'type_column_name'
Is the name of a column in qualified_table_name that holds the document
type of column_name. This column must be char, nchar, varchar, or
nvarchar. It is only used when the data type of column_name is an image.
type_column_name is sysname, with no default.

I can confirm that the DocType column is char(4) with '.pdf' in there.

Any ideas? John, are you able to index and search the pdf file I've
attached? How long should it take to index one pdf file?

*** Sent via Devdex http://www.devdex.com ***
Don't just participate in USENET...get rewarded for it!




Reply With Quote
  #3  
Old   
Bob Horkay
 
Posts: n/a

Default Re: Need help with searching PDF files stored in SQL - 07-09-2004 , 07:20 AM



I have found this necessary for Full text indexing of PDF's

Modify the Registry to set full text indexing to single threading, the
PDF Filter does not support multi-threading; The key is:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Search\1.0\G athering Manager\
And change the value of: RobotThreadsNumber to 1.

There is a kb article somewhere on it, but I've forgotten the
number...

Bob Horkay

Reply With Quote
  #4  
Old   
John Kane
 
Posts: n/a

Default Re: Need help with searching PDF files stored in SQL - 07-09-2004 , 09:00 AM



Bob,
While that is sometimes the issue with the FT Indexing of PDF files using
Adobe's PDF IFilter, that is not the case for Louie's specific PDF file and
his specific problem in this thread. He did email me the PDF file and I used
Filtdump to analyze the content and because of how the PDF was created, the
content is "garbage" to the PDF IFilter.

Filtdump. that is part of the Platform SDK that can dump and analyze the
content of files based upon the IFilter, in this case Adobe's PDF IFilter.
I've run this utility against your PDF file (test.pdf) and below is a part
of the output:

filtdump -b d:\test.pdf
-- output:

Microsoft Word - GAOG Prospectus Rays 5 Mar working copy.doc
! !"#$%&$ ''()*&+, (-, (
(...'+'+'/"#$%!&''()*'!'!+,-".'&!$''/01''!2!3+!'!! ))(.'#!41'')/5#&"6-!7#!
""8..&"(/.'41&!!,00$9#&''$'&#05#'.''#, 0: ;
<1<7.=("66('<#=5"66('<..!;!=/"66('<.&;!!>=)""66(??????????????????????????? ?
?????????????????????
!!=566666=@606???????????????????????????????????? ???????????? A;;#; ;$
!)6666!@606!)6666!)6666! .....

<snip>

While I was able to open this pdf file with Adobe's Acarbot PDF reader, it
looks to me that this PDF file was not actually created via Adobe's PDF
Creater and instead was possible created via MS Word or some other 3rd party
tool or was converted improperly from a MS Word doc file.

FYI, the issue you speak of is doc'ed in KB article "Q323040 BUG: SQL Server
Full-Text Population by Using a Single-Threaded Filter DLL or a PDF Filter
DLL May Not Succeed" at
http://support.microsoft.com/default...;en-us;Q323040

Regards,
John


"Bob Horkay" <bob (AT) lifeasbob (DOT) com> wrote

Quote:
I have found this necessary for Full text indexing of PDF's

Modify the Registry to set full text indexing to single threading, the
PDF Filter does not support multi-threading; The key is:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Search\1.0\G athering Manager\
And change the value of: RobotThreadsNumber to 1.

There is a kb article somewhere on it, but I've forgotten the
number...

Bob Horkay



Reply With Quote
  #5  
Old   
Louie
 
Posts: n/a

Default Re: Need help with searching PDF files stored in SQL - 07-11-2004 , 07:03 PM



John,

I have tried another pdf (from Acrobat itself) and it worked. I think we
finaly located the source of the problem.

According to your explanation:
"... was converted improperly from a MS Word doc file."

So, is it true that if a MS Word (or any files) was properly converted
to pdf using a 3rd party software, it would work.

The reason I am asking is that in my development environment, all PDFs
are created/provided from various sources, we don't generate the PDFs
ourselves. Which means we need to handle PDFs that are created by
software other than Acrobat's.

I am going to do some tests on other PDFs as well, and I will let you
know the outcome.

Thanks again,
Louie


*** Sent via Devdex http://www.devdex.com ***
Don't just participate in USENET...get rewarded for it!

Reply With Quote
  #6  
Old   
John Kane
 
Posts: n/a

Default Re: Need help with searching PDF files stored in SQL - 07-11-2004 , 10:22 PM



You're welcome, Louie,
Whether or not the PDF file was "improperly converted" or properly converted
from MS Word as the header info (Microsoft Word - GAOG Prospectus Rays 5 Mar
working copy.doc) to the PDF format, I cannot say, but for some reason the
Adobe PDF IFilter was not able to recognize this as a proper PDF file. You
might want to talk to Adobe and ask them about this situation.

Either way, one thing you can do is to open other problem PDF files with
either Notepad. or some other utility (filtdump.exe) and look for the
*correct* string or output from filtdump. Yes, please do let me and others
on this newsgroup know what your research turns up!

Regards,
John




"Louie" <anonymous (AT) devdex (DOT) com> wrote

Quote:
John,

I have tried another pdf (from Acrobat itself) and it worked. I think we
finaly located the source of the problem.

According to your explanation:
"... was converted improperly from a MS Word doc file."

So, is it true that if a MS Word (or any files) was properly converted
to pdf using a 3rd party software, it would work.

The reason I am asking is that in my development environment, all PDFs
are created/provided from various sources, we don't generate the PDFs
ourselves. Which means we need to handle PDFs that are created by
software other than Acrobat's.

I am going to do some tests on other PDFs as well, and I will let you
know the outcome.

Thanks again,
Louie


*** Sent via Devdex http://www.devdex.com ***
Don't just participate in USENET...get rewarded for it!



Reply With Quote
  #7  
Old   
Louie
 
Posts: n/a

Default Re: Need help with searching PDF files stored in SQL - 07-26-2004 , 12:31 AM



Eventually, we decided to extract text from the pdf files and store the
text instead. Since we need to retrieve the pdf files after searching is
done, so there is no point storing the actual files twice.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!

Reply With Quote
  #8  
Old   
John Kane
 
Posts: n/a

Default Re: Need help with searching PDF files stored in SQL - 07-26-2004 , 09:32 AM



Louie,
Thank you for the feedback on what your research turned up and your
solution! Since you're storing the text of the pdf files (and other file
types too) in SQL Server, can I assume you will store only a pointer to the
actual pdf files on disk for retrieval of the files when required?

Thank again,
John



"Louie" <anonymous (AT) devdex (DOT) com> wrote

Quote:
Eventually, we decided to extract text from the pdf files and store the
text instead. Since we need to retrieve the pdf files after searching is
done, so there is no point storing the actual files twice.

*** Sent via Developersdex http://www.developersdex.com ***
Don't just participate in USENET...get rewarded for it!



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.