dbTalk Databases Forums  

Spliting text into words and punctation

comp.databases.ibm-db2 comp.databases.ibm-db2


Discuss Spliting text into words and punctation in the comp.databases.ibm-db2 forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Paul Vernon
 
Posts: n/a

Default Spliting text into words and punctation - 08-19-2003 , 04:03 PM






What would be a neat (i.e. reasonably efficient) way of splitting paragraphs of text?

E.g splitting

VALUES ('I would, if possible, like to see a sentence (such as this one) be split.
Hopefully into words, punctuation and tokens.')

Into

SEQ 2
----------- -------------------
1 I
2 would
3 ,
4 if
5 possible
6 ,
7 like
8 to
9 see
10 a
11 sentence
12 (
13 such
14 as
15 this
16 one
17 )
18 be
19 split
20 .
21 Hopefully
22 into
23 words
24 ,
25 punctuation
26 and
27 tokens
28 .




Regards
Paul Vernon
Business Intelligence, IBM Global Services




Reply With Quote
  #2  
Old   
Rhino
 
Posts: n/a

Default Re: Spliting text into words and punctation - 08-19-2003 , 05:31 PM







"Paul Vernon" <paul.vernon (AT) ukk (DOT) ibmm.comm> wrote

Quote:
What would be a neat (i.e. reasonably efficient) way of splitting
paragraphs of text?

E.g splitting

VALUES ('I would, if possible, like to see a sentence (such as this one)
be split.
Hopefully into words, punctuation and tokens.')

Into

SEQ 2
----------- -------------------
1 I
2 would
3 ,
4 if
5 possible
6 ,
7 like
8 to
9 see
10 a
11 sentence
12 (
13 such
14 as
15 this
16 one
17 )
18 be
19 split
20 .
21 Hopefully
22 into
23 words
24 ,
25 punctuation
26 and
27 tokens
28 .
If you don't mind a small amount of programming, Java has a StringTokenizer
class that would do the job very nicely in only a few lines of code. JDBC
allows your Java program to access DB2 data quite simply.

Another possibility that would have even less programming in it would be to
write a Java stored procedure based on the StringTokenizer class. You could
create and test the stored procedure using the Stored Procedure Builder in
DB2; its input would be the value that needs to be parsed and its output
would be the words/tokens/punctuation.

Rhino




Reply With Quote
  #3  
Old   
Tokunaga T.
 
Posts: n/a

Default Re: Spliting text into words and punctation - 08-20-2003 , 05:59 AM



"Rhino" <rhino1 (AT) NOSPAM (DOT) sympatico.ca> wrote in message > "Paul Vernon" <paul.vernon (AT) ukk (DOT) ibmm.comm> wrote in message
Quote:
news:bhu3jp$13ga$1 (AT) gazette (DOT) almaden.ibm.com...
What would be a neat (i.e. reasonably efficient) way of splitting
paragraphs of text?

E.g splitting

........

If you don't mind a small amount of programming, Java has a StringTokenizer
class that would do the job very nicely in only a few lines of code. JDBC
allows your Java program to access DB2 data quite simply.

Or, you can write a SQL Query, like this:

WITH
Source(text) AS (
VALUES ('I would, if possible, like to see a sentence (such as this one) be split.
Hopefully into words, punctuation and tokens.')
)
,
Splitting (Seq, token, rest) AS (
SELECT 0, VARCHAR('', 50), LTRIM(text)
FROM Source
UNION ALL
SELECT pre.Seq + 1
, VARCHAR(SUBSTR(pre.rest, 1, next_pos - 1), 50)
, LTRIM(SUBSTR(pre.rest || ' ', next_pos))
FROM (SELECT pre.Seq
, CASE
WHEN SUBSTR(pre.rest, 1, 1) IN ( '.' , ',' , '(' , ')' ) THEN
2
ELSE POSSTR(TRANSLATE(pre.rest, ' ', '.,()'), ' ')
END AS next_pos
, pre.rest
FROM Splitting pre
WHERE pre.Seq < 1000
AND pre.rest <> ''
) AS pre
)

SELECT Seq, token
FROM splitting
WHERE Seq > 0
ORDER BY Seq
;
---------------------------------------------------

SEQ TOKEN
----------- --------------------------------------------------
1 I
2 would
3 ,
4 if
5 possible
6 ,
7 like
8 to
9 see
10 a
11 sentence
12 (
13 such
14 as
15 this
16 one
17 )
18 be
19 split
20 .
21 Hopefully
22 into
23 words
24 ,
25 punctuation
26 and
27 tokens
28 .

28 record(s) selected.

I don't know about efficient.
Anyway, it will be not so difficult to make a UDF(SQL Table) based on this example.


Reply With Quote
  #4  
Old   
Paul Vernon
 
Posts: n/a

Default Re: Spliting text into words and punctation - 08-27-2003 , 06:09 AM



"Tokunaga T." <tonkuma (AT) jp (DOT) ibm.com> wrote

[snip]
Quote:
Or, you can write a SQL Query, like this:

WITH
Source(text) AS (
VALUES ('I would, if possible, like to see a sentence (such as this one) be split.
Hopefully into words, punctuation and tokens.')
)
,
Splitting (Seq, token, rest) AS (
SELECT 0, VARCHAR('', 50), LTRIM(text)
FROM Source
UNION ALL
SELECT pre.Seq + 1
, VARCHAR(SUBSTR(pre.rest, 1, next_pos - 1), 50)
, LTRIM(SUBSTR(pre.rest || ' ', next_pos))
FROM (SELECT pre.Seq
, CASE
WHEN SUBSTR(pre.rest, 1, 1) IN ( '.' , ',' , '(' , ')' ) THEN
2
ELSE POSSTR(TRANSLATE(pre.rest, ' ', '.,()'), ' ')
END AS next_pos
, pre.rest
FROM Splitting pre
WHERE pre.Seq < 1000
AND pre.rest <> ''
) AS pre
)

SELECT Seq, token
FROM splitting
WHERE Seq > 0
ORDER BY Seq
;
[snip]

I don't know about efficient.
Anyway, it will be not so difficult to make a UDF(SQL Table) based on this example.
Thanks for your help Tokunaga. I had got something similar but it was not quite
working.

However we get hit on efficiency with the above. It does not scale past sentences of
a few hundred bytes.
For the moment I've gone with this compound SQL in a trigger, and forgone being able
to practically use a VIEW to do the split.

Regards
Paul Vernon
Business Intelligence, IBM Global Services

BEGIN ATOMIC

DECLARE n INTEGER DEFAULT 1;
DECLARE i INTEGER DEFAULT 1;
DECLARE j INTEGER DEFAULT 1;
DECLARE l INTEGER;
DECLARE WW VARCHAR(80) DEFAULT '';

SET l = LENGTH(RTRIM(Plain_Text)) + 1;

main_loop:
WHILE j <= l
DO

IF VARCHAR(SUBSTR(Plain_Text, j, 1)) = ' '
THEN
INSERT INTO PAGE_EXPLODE (SEQ, WORD)
VALUES (n, SUBSTR(Plain_Text,i,j - i)) ;
SET i = j + 1;
SET j = j + 1;
SET n = n + 1;
ELSEIF VARCHAR(SUBSTR(Plain_Text, j, 1)) IN ('.', ',', '(', ')' )
THEN
INSERT INTO PAGE_EXPLODE (SEQ, WORD)
VALUES (n, SUBSTR(Plain_Text,i,j - i)) ;
SET i = j;
SET n = n + 1;
END IF;

SET j = j + 1;

END WHILE main_loop;

END




Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.