![]() | |
#1
| |||
| |||
|
|
Hello, I'm struggling with text searches, both in functionality and speed. I have a table with an ID and a value, the value being textual. The problem is that some words can have different meanings (Casino: the movie, gambling hall, supermarket chain,...). To distinguish between these meanings, they used to clarify them in brackets. This obviously isn't the best way to do something like that, but the data is there and I have to work with it. At least, that's what I should do, but I don't know how to go about it. Some of the entries could be 1 dog 2 Rock festival (14/07) 3 bowl 4 blue (color) 5 magazine 6 Casino (gambling) 7 Casino (supermarket) 8 food 9 blue (feeling) 10 bark 11 Casino (movie) ... What I would like to know: - how many terms are there with multiple meanings? - which terms have how many different meanings? - what are those different meanings? - ... I don't know how to go about it, since I can only be sure by looking for the part of the value up until the opening bracket and compare that to every other entry. I don't know how to do this in one SQL statement, I think I'd have to loop over the total table, which would create a huge load. Does anybody know of a way to do such thing? |
#2
| |||
| |||
|
|
What dbms are you using? You may not be able to do what you want with SQL unless the dbms you use has sufficient string manipulation operations or allows one to create user-defined functions. I'm using MySQL at the moment, but something else might be used in the future (which needn't be a problem). By syntactically encoding information in a substring instead of representing it as a separate value, someone has taken simple queries and turned them into parsing exercises. There are real consequences to violating the information principle. I suggest you correct the logical schema first, and then your queries will become very easy. The problem is that I have an enormous amount of data like this (at least when it should be edited by hand), so I was really hoping for a more automated way of doing this (with still some performance in mind). |
#3
| |||
| |||
|
|
"Tim" <tim (AT) NOSPAM (DOT) invalid> wrote in message news:tnbbgvcg57rvattpj1fafj24cb38mprlbr (AT) 4ax (DOT) com... What dbms are you using? You may not be able to do what you want with SQL unless the dbms you use has sufficient string manipulation operations or allows one to create user-defined functions. I'm using MySQL at the moment, but something else might be used in the future (which needn't be a problem). By syntactically encoding information in a substring instead of representing it as a separate value, someone has taken simple queries and turned them into parsing exercises. There are real consequences to violating the information principle. I suggest you correct the logical schema first, and then your queries will become very easy. The problem is that I have an enormous amount of data like this (at least when it should be edited by hand), so I was really hoping for a more automated way of doing this (with still some performance in mind). From a quick perusal of the mysql documentation at http://www.mysql.com/documentation/m...rence.html#Str ing_functions, it appears you should find the following string functions useful: LOCATE, POSITION, INSTR, LEFT, RIGHT, SUBSTRING, MID, SUBSTRING, TRIM. You may also find IF and CASE/WHEN very useful. Consider automating the task with something like: UPDATE table SET baseword = CASE WHEN LOCATE( word, '(' ) < 1 THEN word ELSE RTRIM( LEFT( word, LOCATE( word, '(' ) - 1 ) ) END , qualifier = CASE WHEN LOCATE( word, '(' ) < 1 THEN NULL ELSE CASE WHEN LOCATE( word, ')', LOCATE( word, '(' ) ) < 1 THEN TRIM( SUBSTRING( word, LOCATE( word, '(' ) + 1 ) ) ELSE TRIM( SUBSTRING( word , LOCATE( word, '(' ) + 1 , LOCATE( word, ')', LOCATE( word, '(' ) ) - LOCATE( word, '(' ) - 1 ) ) END END ; |
#4
| |||
| |||
|
|
Bob Badour wrote: "Tim" <tim (AT) NOSPAM (DOT) invalid> wrote in message news:tnbbgvcg57rvattpj1fafj24cb38mprlbr (AT) 4ax (DOT) com... What dbms are you using? You may not be able to do what you want with SQL unless the dbms you use has sufficient string manipulation operations or allows one to create user-defined functions. I'm using MySQL at the moment, but something else might be used in the future (which needn't be a problem). By syntactically encoding information in a substring instead of representing it as a separate value, someone has taken simple queries and turned them into parsing exercises. There are real consequences to violating the information principle. I suggest you correct the logical schema first, and then your queries will become very easy. The problem is that I have an enormous amount of data like this (at least when it should be edited by hand), so I was really hoping for a more automated way of doing this (with still some performance in mind). From a quick perusal of the mysql documentation at http://www.mysql.com/documentation/m...rence.html#Str ing_functions, it appears you should find the following string functions useful: LOCATE, POSITION, INSTR, LEFT, RIGHT, SUBSTRING, MID, SUBSTRING, TRIM. You may also find IF and CASE/WHEN very useful. Consider automating the task with something like: UPDATE table SET baseword = CASE WHEN LOCATE( word, '(' ) < 1 THEN word ELSE RTRIM( LEFT( word, LOCATE( word, '(' ) - 1 ) ) END , qualifier = CASE WHEN LOCATE( word, '(' ) < 1 THEN NULL ELSE CASE WHEN LOCATE( word, ')', LOCATE( word, '(' ) ) < 1 THEN TRIM( SUBSTRING( word, LOCATE( word, '(' ) + 1 ) ) ELSE TRIM( SUBSTRING( word , LOCATE( word, '(' ) + 1 , LOCATE( word, ')', LOCATE( word, '(' ) ) - LOCATE( word, '(' ) - 1 ) ) END END ; But I think, Bob, that you first suggestion is best. He'll really be better off parsing the fields once, converting them to appropriately designed tables. Especially if there really is "an enormous amount" of data, whatever that means to him. (I once thought a Megabyte was enormous.) |
#5
| |||
| |||
|
|
"Tim" <tim (AT) NOSPAM (DOT) invalid> wrote in message news:u30bgvklgbt4tv8u309vu558t25kp66mcn (AT) 4ax (DOT) com... Hello, I'm struggling with text searches, both in functionality and speed. I have a table with an ID and a value, the value being textual. The problem is that some words can have different meanings (Casino: the movie, gambling hall, supermarket chain,...). To distinguish between these meanings, they used to clarify them in brackets. This obviously isn't the best way to do something like that, but the data is there and I have to work with it. At least, that's what I should do, but I don't know how to go about it. Some of the entries could be 1 dog 2 Rock festival (14/07) 3 bowl 4 blue (color) 5 magazine 6 Casino (gambling) 7 Casino (supermarket) 8 food 9 blue (feeling) 10 bark 11 Casino (movie) ... |
|
What I would like to know: - how many terms are there with multiple meanings? COUNT LEXICON WITH A2 - which terms have how many different meanings? LIST LEXICON CNT2 |
|
- what are those different meanings? see listing above - ... I don't know how to go about it, since I can only be sure by looking for the part of the value up until the opening bracket and compare that to every other entry. I don't know how to do this in one SQL statement, I think I'd have to loop over the total table, which would create a huge load. Does anybody know of a way to do such thing? |
#6
| |||
| |||
|
|
"Tim" <tim (AT) NOSPAM (DOT) invalid> wrote in message news:u30bgvklgbt4tv8u309vu558t25kp66mcn (AT) 4ax (DOT) com... Hello, I'm struggling with text searches, both in functionality and speed. I have a table with an ID and a value, the value being textual. The problem is that some words can have different meanings (Casino: the movie, gambling hall, supermarket chain,...). To distinguish between these meanings, they used to clarify them in brackets. This obviously isn't the best way to do something like that, but the data is there and I have to work with it. At least, that's what I should do, but I don't know how to go about it. Some of the entries could be 1 dog 2 Rock festival (14/07) 3 bowl 4 blue (color) 5 magazine 6 Casino (gambling) 7 Casino (supermarket) 8 food 9 blue (feeling) 10 bark 11 Casino (movie) ... If I was faced with this task on a Pick system I would do the following... Let's call the incoming table 'Text' and assume it's in a *nix or dos folder or directory called 'dirfolder'. CREATE-FILE LEXICON 1 101 CREATE-INDEX LEXICON A1 ED BP GETTEXT .I 001 OPEN '{dos:\ or *nix:/}dirfolder' TO OSFILE ELSE DEBUG 002 READ TEXTSTR FROM OSFILE,'Text' ELSE DEBUG 003 OPEN 'LEXICON' TO LEXICON ELSE DEBUG 004 ROOT 'LEXICON','A1' TO PHRASEINDEX ELSE DEBUG 005 NEXTITEMID=1 006 FOR X=DCOUNT(TEXTSTR,@AM) TO 1 STEP -1 007 LN=TEXSTR<X> ;* Get a line of text, something like "2 Rock festival (14/07)" 008 ID=FIELD(LN,' ',1) ;* Extract the ID ('2') 009 WORDS=LN[COL1()+1,99] ;* Get the text without the ID (assume max len = 99) 010 OPENBRS=COUNT(WORDS,'(') ;* How many open brackets are there? 011 IF OPENBRS AND WORDS[LEN(WORDS),1]=')' THEN ;* If there is a 'qualifier'... 012 OPENBR=INDEX(WORDS,'(',OPENBRS) ;* Get the position of the last open bracket 013 KEYWORDS=WORDS[1,OPENBR-2] ;* In the example this would be the phrase 'Rock festival' 014 QUALIFIER=LN[OPENBR+1,99] ;* In the example this would be '14/07)' 015 QUALIFIER=QUALIFIER[1,LEN(QUALIFIER)-1] ;* '14/07' 016 END ELSE 017 KEYWORDS=WORDS 018 QUALIFIER='' 019 END 020 KEY('R',PHRASEINDEX,KEYWORDS,ITEMID) THEN ;* Get the key for the item containing the phrase ('Rock festival') 021 READU REC FROM LEXICON,ITEMID ELSE DEBUG ;* Read the item 022 LOCATE QUALIFIER IN REC<2>,1 BY 'AL' SETTING POS THEN ;* Look for '14/07' in the list of qualifiers in attribute 2 023 REC<3,POS,-1>=ID ;* Add the ID ('2') to the list of IDs for the combination of phrase ('Rock festival') & qualifier ('14/07') 024 END ELSE 025 INS QUALIFIER BEFORE REC<2,POS> ;* Add the 'qualifier' ('14/07') to the list of qualifiers for the phrase 026 INS ID BEFORE REC<3,POS> ;* Record the ID ('2') 027 END 028 END ELSE 029 ITEMID=NEXTITEMID 030 NEXTITEMID+=1 031 REC=KEYWORDS 032 REC<2>=QUALIFIER 033 REC<3>=ID 034 END 035 WRITE REC ON LEXICON,ITEMID 036 NEXT X 037 STOP After running the above program (untested as it is), we would have a file like this: 1 001 dog 002 003 1 2 001 Rock festival 002 14/07 003 2 3 001 bowl 002 003 3 4 001 blue 002 colour]feeling 003 4]9 5 001 magazine 002 003 5 6 001 Casino 002 gambling]supermarket]movie 003 6]7]11 7 001 food 002 003 8 8 001 bark 002 003 10 So that if we were to simply list the file we would see something like this: File name: LEXICON ItemID...Phrase (A1).....Qualifier (A2)...Text ID (A3) 1........dog..............................1 2........Rock festival...14/07............2 3........bowl.............................3 4........blue............color............4 .........................feeling..........9 5........magazine.........................5 6........Casino..........gambling.........6 .........................supermarket......7 .........................movie............11 7........food.............................8 8........bark.............................10 ... What I would like to know: - how many terms are there with multiple meanings? COUNT LEXICON WITH A2 - which terms have how many different meanings? LIST LEXICON CNT2 where CNT2 is a dictionary item with "F;2;(S;'1';'0');S" in attribute 8 - what are those different meanings? see listing above - ... I don't know how to go about it, since I can only be sure by looking for the part of the value up until the opening bracket and compare that to every other entry. I don't know how to do this in one SQL statement, I think I'd have to loop over the total table, which would create a huge load. Does anybody know of a way to do such thing? If you'd like to give Pick a try I can supply links to free downloads for evaluation purposes. Regards Mike. |
#7
| |||
| |||
|
|
Mike Preece wrote: "Tim" <tim (AT) NOSPAM (DOT) invalid> wrote in message news:u30bgvklgbt4tv8u309vu558t25kp66mcn (AT) 4ax (DOT) com... Hello, I'm struggling with text searches, both in functionality and speed. I have a table with an ID and a value, the value being textual. The problem is that some words can have different meanings (Casino: the movie, gambling hall, supermarket chain,...). To distinguish between these meanings, they used to clarify them in brackets. This obviously isn't the best way to do something like that, but the data is there and I have to work with it. At least, that's what I should do, but I don't know how to go about it. |
|
If you prefer sticking with a traditional RDBMS.... The quickest retrieval on text data is based on the use of fully inverted data, that is, indexes on each word that appears in the text being indexed. (With an exception being made for the most common words in the language. For example words such as "the", "a", "is" are not indexed.) |
#8
| |||
| |||
|
|
"Jerry Gitomer" <jgitomer (AT) erols (DOT) com> wrote in message news:be6qng$mau$1 (AT) bob (DOT) news.rcn.net... Mike Preece wrote: "Tim" <tim (AT) NOSPAM (DOT) invalid> wrote in message news:u30bgvklgbt4tv8u309vu558t25kp66mcn (AT) 4ax (DOT) com... Hello, I'm struggling with text searches, both in functionality and speed. I have a table with an ID and a value, the value being textual. The problem is that some words can have different meanings (Casino: the movie, gambling hall, supermarket chain,...). To distinguish between these meanings, they used to clarify them in brackets. This obviously isn't the best way to do something like that, but the data is there and I have to work with it. At least, that's what I should do, but I don't know how to go about it. [ridiculously complex pick example snipped] |
| If you prefer sticking with a traditional RDBMS.... The quickest retrieval on text data is based on the use of fully inverted data, that is, indexes on each word that appears in the text being indexed. (With an exception being made for the most common words in the language. For example words such as "the", "a", "is" are not indexed.) Did it ever occur to you that Tim's problem relates to how to express what he wants? Until he can express what he wants, the availability of indexing strategies are useless to him. |
#9
| |||
| |||
|
|
Bob Badour wrote: "Jerry Gitomer" <jgitomer (AT) erols (DOT) com> wrote in message news:be6qng$mau$1 (AT) bob (DOT) news.rcn.net... Mike Preece wrote: "Tim" <tim (AT) NOSPAM (DOT) invalid> wrote in message news:u30bgvklgbt4tv8u309vu558t25kp66mcn (AT) 4ax (DOT) com... Hello, I'm struggling with text searches, both in functionality and speed. I have a table with an ID and a value, the value being textual. The problem is that some words can have different meanings (Casino: the movie, gambling hall, supermarket chain,...). To distinguish between these meanings, they used to clarify them in brackets. This obviously isn't the best way to do something like that, but the data is there and I have to work with it. At least, that's what I should do, but I don't know how to go about it. [ridiculously complex pick example snipped] If you prefer sticking with a traditional RDBMS.... The quickest retrieval on text data is based on the use of fully inverted data, that is, indexes on each word that appears in the text being indexed. (With an exception being made for the most common words in the language. For example words such as "the", "a", "is" are not indexed.) Did it ever occur to you that Tim's problem relates to how to express what he wants? Until he can express what he wants, the availability of indexing strategies are useless to him. Either you failed to read the paragraph in my response where I suggested the use of a commercially available RDBMS with a text search option that includes conditional search operations that will easily satisfy Tim's stated requirments or I failed to make myself clear. |
#10
| |||
| |||
|
|
"Bob Badour" <bbadour (AT) golden (DOT) net> wrote "Jerry Gitomer" <jgitomer (AT) erols (DOT) com> wrote in message news:be6qng$mau$1 (AT) bob (DOT) news.rcn.net... Mike Preece wrote: "Tim" <tim (AT) NOSPAM (DOT) invalid> wrote in message news:u30bgvklgbt4tv8u309vu558t25kp66mcn (AT) 4ax (DOT) com... Hello, I'm struggling with text searches, both in functionality and speed. I have a table with an ID and a value, the value being textual. The problem is that some words can have different meanings (Casino: the movie, gambling hall, supermarket chain,...). To distinguish between these meanings, they used to clarify them in brackets. This obviously isn't the best way to do something like that, but the data is there and I have to work with it. At least, that's what I should do, but I don't know how to go about it. [ridiculously complex pick example snipped] Or - to put it another way - full parsing and data transformation code supplied. |
|
If you prefer sticking with a traditional RDBMS.... The quickest retrieval on text data is based on the use of fully inverted data, that is, indexes on each word that appears in the text being indexed. (With an exception being made for the most common words in the language. For example words such as "the", "a", "is" are not indexed.) Did it ever occur to you that Tim's problem relates to how to express what he wants? Until he can express what he wants, the availability of indexing strategies are useless to him. So supply a solution Bob. Your inability to .... proves .... complex ... weak... blah ... blah... insult... arrogant statement... drivel... etc. |
![]() |
| Thread Tools | |
| Display Modes | |
| |