dbTalk Databases Forums  

Parsing Unstructured Text

comp.databases.filemaker comp.databases.filemaker


Discuss Parsing Unstructured Text in the comp.databases.filemaker forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
squeed2000@yahoo.com
 
Posts: n/a

Default Parsing Unstructured Text - 02-26-2007 , 12:51 PM






OK, I've got a real brain twister. I have unstructured text in the
form of multiple press releases. I need to parse out anywhere within
the press release where there is a contact name and title.

So for example, here are sections of 3 press releases.

Example #1.

bla bla bla, announced today the appointment of Joe Smith to the
position of Vice President

Example #2.

bla bla bla, today announced that Jane Smith has been appointed to the
position of CFO

Example #3.

bla bla bla, announces the appointment of Crag Smith as Director

There are hundreds of possible combinations that can occur as shown
above. So what I am thinking that I need to do is to define rules
using the MIDDLE and POSITION functions of Filemaker to parse what is
inbetween the rules.

So to determine the rules I need a rules database that will house 2
rules for each type of press release. I'll need an "In front of what
I want parsed rule" and I'll need a "Behind of what I want parsed
rule".

Using preee release example #1. I would have a database that has 3
fields, Field #1 is the holder of the press release data, Field #2 is
a Pattern Count for Rule #1 Front and Field #3 is a Pattern Count for
Rule #1 Back.

Field #1 = Text of press release
Field #2 = PatternCount(Field #1,"announced today the appointment of")
Field #3 = PatternCount(Field #1,"to the position of")

If both calculations = 1 then this "rule" would apply and I would use
MIDDLE & POSITION calculations to locate those patterns and parse what
I need. If they both didn't = 1, then I would go to the next rule in
the database and see if that rule applied.

Am I thinking about this the right way or is there an easier method ?

-S


Reply With Quote
  #2  
Old   
Helpful Harry
 
Posts: n/a

Default Re: Parsing Unstructured Text - 02-26-2007 , 05:39 PM






In article <1172515896.921719.10190 (AT) p10g2000cwp (DOT) googlegroups.com>,
squeed2000 (AT) yahoo (DOT) com wrote:

Quote:
OK, I've got a real brain twister. I have unstructured text in the
form of multiple press releases. I need to parse out anywhere within
the press release where there is a contact name and title.

So for example, here are sections of 3 press releases.

Example #1.

bla bla bla, announced today the appointment of Joe Smith to the
position of Vice President

Example #2.

bla bla bla, today announced that Jane Smith has been appointed to the
position of CFO

Example #3.

bla bla bla, announces the appointment of Crag Smith as Director

There are hundreds of possible combinations that can occur as shown
above. So what I am thinking that I need to do is to define rules
using the MIDDLE and POSITION functions of Filemaker to parse what is
inbetween the rules.

So to determine the rules I need a rules database that will house 2
rules for each type of press release. I'll need an "In front of what
I want parsed rule" and I'll need a "Behind of what I want parsed
rule".

Using preee release example #1. I would have a database that has 3
fields, Field #1 is the holder of the press release data, Field #2 is
a Pattern Count for Rule #1 Front and Field #3 is a Pattern Count for
Rule #1 Back.

Field #1 = Text of press release
Field #2 = PatternCount(Field #1,"announced today the appointment of")
Field #3 = PatternCount(Field #1,"to the position of")

If both calculations = 1 then this "rule" would apply and I would use
MIDDLE & POSITION calculations to locate those patterns and parse what
I need. If they both didn't = 1, then I would go to the next rule in
the database and see if that rule applied.

Am I thinking about this the right way or is there an easier method ?
It looks about right, but in reality few press releases will be worded
EXACTLY the same. Unless you're getting hundreds of press releases a
day, you're probably easier to simply read them and manually copy/paste
the names and titles across to FileMaker records.


Helpful Harry
Hopefully helping harassed humans happily handle handiwork hardships ;o)


Reply With Quote
  #3  
Old   
Carpeflora
 
Posts: n/a

Default Re: Parsing Unstructured Text - 02-26-2007 , 09:25 PM



On Feb 26, 6:39 pm, Helpful Harry <helpful_ha... (AT) nom (DOT) de.plume.com>
wrote:
Quote:
In article <1172515896.921719.10... (AT) p10g2000cwp (DOT) googlegroups.com>,



squeed2... (AT) yahoo (DOT) com wrote:
OK, I've got a real brain twister. I have unstructured text in the
form of multiple press releases. I need to parse out anywhere within
the press release where there is a contact name and title.

So for example, here are sections of 3 press releases.

Example #1.

bla bla bla, announced today the appointment of Joe Smith to the
position of Vice President

Example #2.

bla bla bla, today announced that Jane Smith has been appointed to the
position of CFO

Example #3.

bla bla bla, announces the appointment of Crag Smith as Director

There are hundreds of possible combinations that can occur as shown
above. So what I am thinking that I need to do is to define rules
using the MIDDLE and POSITION functions of Filemaker to parse what is
inbetween the rules.

So to determine the rules I need a rules database that will house 2
rules for each type of press release. I'll need an "In front of what
I want parsed rule" and I'll need a "Behind of what I want parsed
rule".

Using preee release example #1. I would have a database that has 3
fields, Field #1 is the holder of the press release data, Field #2 is
a Pattern Count for Rule #1 Front and Field #3 is a Pattern Count for
Rule #1 Back.

Field #1 = Text of press release
Field #2 = PatternCount(Field #1,"announced today the appointment of")
Field #3 = PatternCount(Field #1,"to the position of")

If both calculations = 1 then this "rule" would apply and I would use
MIDDLE & POSITION calculations to locate those patterns and parse what
I need. If they both didn't = 1, then I would go to the next rule in
the database and see if that rule applied.

Am I thinking about this the right way or is there an easier method ?

It looks about right, but in reality few press releases will be worded
EXACTLY the same. Unless you're getting hundreds of press releases a
day, you're probably easier to simply read them and manually copy/paste
the names and titles across to FileMaker records.

Helpful Harry
Hopefully helping harassed humans happily handle handiwork hardships ;o)
If there is a huge number of press releases (hundreds...?), you could
first run a test to pull out any multiple letter (to not catch "I")
words that start with a capital letter. Will give you some garbage
(maybe a lot, but might help). Maybe when there are at least two
words next to each other with capital letters. Can't say I know the
exact syntax for you.

Lara



Reply With Quote
  #4  
Old   
Amazing Iceman
 
Posts: n/a

Default Re: Parsing Unstructured Text - 02-28-2007 , 12:18 AM



First, is this unstructure text in a raw text file, or is it in HTML, Word,
etc?
If it's not a raw text file, then you may be able to search for an
end-of-paragraph character, carriage return, or similar text formatting
codes.

Always make sure to disable "Word Wrap" on the Text Editor you use to view
the file.

If it's a raw file then look for a common denominator that separates each
press release.

For example, if each press release is separated by a blank line between
them, then that's what you have to look for in your rules.
If there's no common denominator, then create one. It may be time consuming,
but there's no other choice. You could just append a string like "#^$&%*"
between press release (you could use CUT-AND-PASTE to do it very quickly),
and setup your rules accordingly. Then let FMP do the rest automatically.

At least is a lot faster than having to cut and paste each press release
into FMP.

Good Luck,

-Amazing Iceman


"Carpeflora" <lara (AT) liquidpointdesign (DOT) com> wrote

Quote:
On Feb 26, 6:39 pm, Helpful Harry <helpful_ha... (AT) nom (DOT) de.plume.com
wrote:
In article <1172515896.921719.10... (AT) p10g2000cwp (DOT) googlegroups.com>,



squeed2... (AT) yahoo (DOT) com wrote:
OK, I've got a real brain twister. I have unstructured text in the
form of multiple press releases. I need to parse out anywhere within
the press release where there is a contact name and title.

So for example, here are sections of 3 press releases.

Example #1.

bla bla bla, announced today the appointment of Joe Smith to the
position of Vice President

Example #2.

bla bla bla, today announced that Jane Smith has been appointed to the
position of CFO

Example #3.

bla bla bla, announces the appointment of Crag Smith as Director

There are hundreds of possible combinations that can occur as shown
above. So what I am thinking that I need to do is to define rules
using the MIDDLE and POSITION functions of Filemaker to parse what is
inbetween the rules.

So to determine the rules I need a rules database that will house 2
rules for each type of press release. I'll need an "In front of what
I want parsed rule" and I'll need a "Behind of what I want parsed
rule".

Using preee release example #1. I would have a database that has 3
fields, Field #1 is the holder of the press release data, Field #2 is
a Pattern Count for Rule #1 Front and Field #3 is a Pattern Count for
Rule #1 Back.

Field #1 = Text of press release
Field #2 = PatternCount(Field #1,"announced today the appointment of")
Field #3 = PatternCount(Field #1,"to the position of")

If both calculations = 1 then this "rule" would apply and I would use
MIDDLE & POSITION calculations to locate those patterns and parse what
I need. If they both didn't = 1, then I would go to the next rule in
the database and see if that rule applied.

Am I thinking about this the right way or is there an easier method ?

It looks about right, but in reality few press releases will be worded
EXACTLY the same. Unless you're getting hundreds of press releases a
day, you're probably easier to simply read them and manually copy/paste
the names and titles across to FileMaker records.

Helpful Harry
Hopefully helping harassed humans happily handle handiwork hardships ;o)

If there is a huge number of press releases (hundreds...?), you could
first run a test to pull out any multiple letter (to not catch "I")
words that start with a capital letter. Will give you some garbage
(maybe a lot, but might help). Maybe when there are at least two
words next to each other with capital letters. Can't say I know the
exact syntax for you.

Lara




Reply With Quote
  #5  
Old   
Grip
 
Posts: n/a

Default Re: Parsing Unstructured Text - 02-28-2007 , 09:15 AM



On Feb 26, 4:39 pm, Helpful Harry <helpful_ha... (AT) nom (DOT) de.plume.com>
wrote:
Quote:
In article <1172515896.921719.10... (AT) p10g2000cwp (DOT) googlegroups.com>,



squeed2... (AT) yahoo (DOT) com wrote:
OK, I've got a real brain twister. I have unstructured text in the
form of multiple press releases. I need to parse out anywhere within
the press release where there is a contact name and title.

So for example, here are sections of 3 press releases.

Example #1.

bla bla bla, announced today the appointment of Joe Smith to the
position of Vice President

Example #2.

bla bla bla, today announced that Jane Smith has been appointed to the
position of CFO

Example #3.

bla bla bla, announces the appointment of Crag Smith as Director

There are hundreds of possible combinations that can occur as shown
above. So what I am thinking that I need to do is to define rules
using the MIDDLE and POSITION functions of Filemaker to parse what is
inbetween the rules.

So to determine the rules I need a rules database that will house 2
rules for each type of press release. I'll need an "In front of what
I want parsed rule" and I'll need a "Behind of what I want parsed
rule".

Using preee release example #1. I would have a database that has 3
fields, Field #1 is the holder of the press release data, Field #2 is
a Pattern Count for Rule #1 Front and Field #3 is a Pattern Count for
Rule #1 Back.

Field #1 = Text of press release
Field #2 = PatternCount(Field #1,"announced today the appointment of")
Field #3 = PatternCount(Field #1,"to the position of")

If both calculations = 1 then this "rule" would apply and I would use
MIDDLE & POSITION calculations to locate those patterns and parse what
I need. If they both didn't = 1, then I would go to the next rule in
the database and see if that rule applied.

Am I thinking about this the right way or is there an easier method ?

It looks about right, but in reality few press releases will be worded
EXACTLY the same. Unless you're getting hundreds of press releases a
day, you're probably easier to simply read them and manually copy/paste
the names and titles across to FileMaker records.

Helpful Harry
Hopefully helping harassed humans happily handle handiwork hardships ;o)
Tis true. There are some things the human computer is better able to
process than computers.

You could also build a custom function that searches the text for the
names in your database rather than for every permutation of "position
of" / "appointment of" etc.

G



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.