![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
OK, I've got a real brain twister. I have unstructured text in the form of multiple press releases. I need to parse out anywhere within the press release where there is a contact name and title. So for example, here are sections of 3 press releases. Example #1. bla bla bla, announced today the appointment of Joe Smith to the position of Vice President Example #2. bla bla bla, today announced that Jane Smith has been appointed to the position of CFO Example #3. bla bla bla, announces the appointment of Crag Smith as Director There are hundreds of possible combinations that can occur as shown above. So what I am thinking that I need to do is to define rules using the MIDDLE and POSITION functions of Filemaker to parse what is inbetween the rules. So to determine the rules I need a rules database that will house 2 rules for each type of press release. I'll need an "In front of what I want parsed rule" and I'll need a "Behind of what I want parsed rule". Using preee release example #1. I would have a database that has 3 fields, Field #1 is the holder of the press release data, Field #2 is a Pattern Count for Rule #1 Front and Field #3 is a Pattern Count for Rule #1 Back. Field #1 = Text of press release Field #2 = PatternCount(Field #1,"announced today the appointment of") Field #3 = PatternCount(Field #1,"to the position of") If both calculations = 1 then this "rule" would apply and I would use MIDDLE & POSITION calculations to locate those patterns and parse what I need. If they both didn't = 1, then I would go to the next rule in the database and see if that rule applied. Am I thinking about this the right way or is there an easier method ? |
#3
| |||
| |||
|
|
In article <1172515896.921719.10... (AT) p10g2000cwp (DOT) googlegroups.com>, squeed2... (AT) yahoo (DOT) com wrote: OK, I've got a real brain twister. I have unstructured text in the form of multiple press releases. I need to parse out anywhere within the press release where there is a contact name and title. So for example, here are sections of 3 press releases. Example #1. bla bla bla, announced today the appointment of Joe Smith to the position of Vice President Example #2. bla bla bla, today announced that Jane Smith has been appointed to the position of CFO Example #3. bla bla bla, announces the appointment of Crag Smith as Director There are hundreds of possible combinations that can occur as shown above. So what I am thinking that I need to do is to define rules using the MIDDLE and POSITION functions of Filemaker to parse what is inbetween the rules. So to determine the rules I need a rules database that will house 2 rules for each type of press release. I'll need an "In front of what I want parsed rule" and I'll need a "Behind of what I want parsed rule". Using preee release example #1. I would have a database that has 3 fields, Field #1 is the holder of the press release data, Field #2 is a Pattern Count for Rule #1 Front and Field #3 is a Pattern Count for Rule #1 Back. Field #1 = Text of press release Field #2 = PatternCount(Field #1,"announced today the appointment of") Field #3 = PatternCount(Field #1,"to the position of") If both calculations = 1 then this "rule" would apply and I would use MIDDLE & POSITION calculations to locate those patterns and parse what I need. If they both didn't = 1, then I would go to the next rule in the database and see if that rule applied. Am I thinking about this the right way or is there an easier method ? It looks about right, but in reality few press releases will be worded EXACTLY the same. Unless you're getting hundreds of press releases a day, you're probably easier to simply read them and manually copy/paste the names and titles across to FileMaker records. Helpful Harry Hopefully helping harassed humans happily handle handiwork hardships ;o) |
#4
| |||
| |||
|
|
On Feb 26, 6:39 pm, Helpful Harry <helpful_ha... (AT) nom (DOT) de.plume.com wrote: In article <1172515896.921719.10... (AT) p10g2000cwp (DOT) googlegroups.com>, squeed2... (AT) yahoo (DOT) com wrote: OK, I've got a real brain twister. I have unstructured text in the form of multiple press releases. I need to parse out anywhere within the press release where there is a contact name and title. So for example, here are sections of 3 press releases. Example #1. bla bla bla, announced today the appointment of Joe Smith to the position of Vice President Example #2. bla bla bla, today announced that Jane Smith has been appointed to the position of CFO Example #3. bla bla bla, announces the appointment of Crag Smith as Director There are hundreds of possible combinations that can occur as shown above. So what I am thinking that I need to do is to define rules using the MIDDLE and POSITION functions of Filemaker to parse what is inbetween the rules. So to determine the rules I need a rules database that will house 2 rules for each type of press release. I'll need an "In front of what I want parsed rule" and I'll need a "Behind of what I want parsed rule". Using preee release example #1. I would have a database that has 3 fields, Field #1 is the holder of the press release data, Field #2 is a Pattern Count for Rule #1 Front and Field #3 is a Pattern Count for Rule #1 Back. Field #1 = Text of press release Field #2 = PatternCount(Field #1,"announced today the appointment of") Field #3 = PatternCount(Field #1,"to the position of") If both calculations = 1 then this "rule" would apply and I would use MIDDLE & POSITION calculations to locate those patterns and parse what I need. If they both didn't = 1, then I would go to the next rule in the database and see if that rule applied. Am I thinking about this the right way or is there an easier method ? It looks about right, but in reality few press releases will be worded EXACTLY the same. Unless you're getting hundreds of press releases a day, you're probably easier to simply read them and manually copy/paste the names and titles across to FileMaker records. Helpful Harry Hopefully helping harassed humans happily handle handiwork hardships ;o) If there is a huge number of press releases (hundreds...?), you could first run a test to pull out any multiple letter (to not catch "I") words that start with a capital letter. Will give you some garbage (maybe a lot, but might help). Maybe when there are at least two words next to each other with capital letters. Can't say I know the exact syntax for you. Lara |
#5
| |||
| |||
|
|
In article <1172515896.921719.10... (AT) p10g2000cwp (DOT) googlegroups.com>, squeed2... (AT) yahoo (DOT) com wrote: OK, I've got a real brain twister. I have unstructured text in the form of multiple press releases. I need to parse out anywhere within the press release where there is a contact name and title. So for example, here are sections of 3 press releases. Example #1. bla bla bla, announced today the appointment of Joe Smith to the position of Vice President Example #2. bla bla bla, today announced that Jane Smith has been appointed to the position of CFO Example #3. bla bla bla, announces the appointment of Crag Smith as Director There are hundreds of possible combinations that can occur as shown above. So what I am thinking that I need to do is to define rules using the MIDDLE and POSITION functions of Filemaker to parse what is inbetween the rules. So to determine the rules I need a rules database that will house 2 rules for each type of press release. I'll need an "In front of what I want parsed rule" and I'll need a "Behind of what I want parsed rule". Using preee release example #1. I would have a database that has 3 fields, Field #1 is the holder of the press release data, Field #2 is a Pattern Count for Rule #1 Front and Field #3 is a Pattern Count for Rule #1 Back. Field #1 = Text of press release Field #2 = PatternCount(Field #1,"announced today the appointment of") Field #3 = PatternCount(Field #1,"to the position of") If both calculations = 1 then this "rule" would apply and I would use MIDDLE & POSITION calculations to locate those patterns and parse what I need. If they both didn't = 1, then I would go to the next rule in the database and see if that rule applied. Am I thinking about this the right way or is there an easier method ? It looks about right, but in reality few press releases will be worded EXACTLY the same. Unless you're getting hundreds of press releases a day, you're probably easier to simply read them and manually copy/paste the names and titles across to FileMaker records. Helpful Harry Hopefully helping harassed humans happily handle handiwork hardships ;o) |
![]() |
| Thread Tools | |
| Display Modes | |
| |