dbTalk Databases Forums  

Finding a Range of Unicode Characters

comp.databases.filemaker comp.databases.filemaker


Discuss Finding a Range of Unicode Characters in the comp.databases.filemaker forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Cecil Bankston
 
Posts: n/a

Default Finding a Range of Unicode Characters - 11-28-2005 , 10:19 PM






When I imported data into FM from another database program, some of the
accented characters were changed into other ASCII or Unicode symbols.
Is there any way I can do a find in a field to locate all records with
any characters or symbols above the ASCII or Unicode range of standard
English letters and punctuation marks without having to do separate
finds for each individual symbol?
--
Cecil N. Bankston
Baton Rouge, LA
USA

Reply With Quote
  #2  
Old   
Bill Marriott
 
Posts: n/a

Default Re: Finding a Range of Unicode Characters - 11-29-2005 , 02:26 AM






Did you select the appropriate option in the "Character set:" drop-down menu
during import? Did you try any of the other options to see if they helped?
Are there other export options from the source database? Re-importing the
data is likely the easiest way to correct this, if it is possible.

Edit-->Find/Replace will require one operation per substitution.

Do you have a complete list of the incorrect/correct symbols? If so, you
could write a calculation:

Substitute(yourField;
"a"; "A";
"b"; "B";
"c"; "C")

Where the lowercase letter represents the incorrect symbol and the uppercase
letter represents the correct one. Your version would have a lot more than
three pairs, however. You would have to do this for each and every field
that could contain the incorrect characters.

(You implement this as a calculated field mirroring the imported field; or
you apply this as an auto-enter rule during import; or as a Replace with
calculated result using the Replace Field contents command.)

Unfortunately, if the data is incorrect in the source file I don't know of
another way to correct this.

Bill

"Cecil Bankston" <cbankston (AT) spamfreecox (DOT) net> wrote

Quote:
When I imported data into FM from another database program, some of the
accented characters were changed into other ASCII or Unicode symbols. Is
there any way I can do a find in a field to locate all records with any
characters or symbols above the ASCII or Unicode range of standard English
letters and punctuation marks without having to do separate finds for each
individual symbol?
--
Cecil N. Bankston
Baton Rouge, LA
USA



Reply With Quote
  #3  
Old   
Cecil Bankston
 
Posts: n/a

Default Re: Finding a Range of Unicode Characters - 11-29-2005 , 02:53 PM



Thanks for the reply.

Bill Marriott wrote:

Quote:
Did you select the appropriate option in the "Character set:" drop-down menu
during import? Did you try any of the other options to see if they helped?
Are there other export options from the source database? Re-importing the
data is likely the easiest way to correct this, if it is possible.
The data was imported years ago from a Superbase file on an Amiga
computer, so re-importing is not an option.

Quote:
Edit-->Find/Replace will require one operation per substitution.

Do you have a complete list of the incorrect/correct symbols?
Unfortunately, no.

If so, you
Quote:
could write a calculation:

Substitute(yourField;
"a"; "A";
"b"; "B";
"c"; "C")

Where the lowercase letter represents the incorrect symbol and the uppercase
letter represents the correct one. Your version would have a lot more than
three pairs, however. You would have to do this for each and every field
that could contain the incorrect characters.
The database is large, so I recognize the incorrect characters only when
I happen to view a particular record that contains them. At that point
I can generally tell or find out what the correct character is supposed
to be and correct it manually. The problem is finding the records that
need correcting without having to browse through thousands of records.
That is why I hoped there was a means of finding any characters or
symbols above the ASCII or Unicode range of standard English letters and
punctuation marks. That found set would be much smaller than the entire
file and would include all the erroneous characters.

Quote:
(You implement this as a calculated field mirroring the imported field; or
you apply this as an auto-enter rule during import; or as a Replace with
calculated result using the Replace Field contents command.)

Unfortunately, if the data is incorrect in the source file I don't know of
another way to correct this.
The source data was correct. I expect the translation from AmigaDOS to
Windows was the problem.

Quote:
Bill

"Cecil Bankston" <cbankston (AT) spamfreecox (DOT) net> wrote in message
news:wFQif.11966$mm5.504 (AT) dukeread03 (DOT) ..

When I imported data into FM from another database program, some of the
accented characters were changed into other ASCII or Unicode symbols. Is
there any way I can do a find in a field to locate all records with any
characters or symbols above the ASCII or Unicode range of standard English
letters and punctuation marks without having to do separate finds for each
individual symbol?
--
Cecil N. Bankston
Baton Rouge, LA
USA


Reply With Quote
  #4  
Old   
42
 
Posts: n/a

Default Re: Finding a Range of Unicode Characters - 11-29-2005 , 03:20 PM



In article <rd3jf.23$oz5.7@dukeread03>, cbankston (AT) spamfreecox (DOT) net
says...
Quote:
Thanks for the reply.

Bill Marriott wrote:

Did you select the appropriate option in the "Character set:" drop-down menu
during import? Did you try any of the other options to see if they helped?
Are there other export options from the source database? Re-importing the
data is likely the easiest way to correct this, if it is possible.

The data was imported years ago from a Superbase file on an Amiga
computer, so re-importing is not an option.

Edit-->Find/Replace will require one operation per substitution.

Do you have a complete list of the incorrect/correct symbols?

Unfortunately, no.

If so, you
could write a calculation:

Substitute(yourField;
"a"; "A";
"b"; "B";
"c"; "C")

Where the lowercase letter represents the incorrect symbol and the uppercase
letter represents the correct one. Your version would have a lot more than
three pairs, however. You would have to do this for each and every field
that could contain the incorrect characters.

The database is large, so I recognize the incorrect characters only when
I happen to view a particular record that contains them. At that point
I can generally tell or find out what the correct character is supposed
to be and correct it manually. The problem is finding the records that
need correcting without having to browse through thousands of records.
That is why I hoped there was a means of finding any characters or
symbols above the ASCII or Unicode range of standard English letters and
punctuation marks. That found set would be much smaller than the entire
file and would include all the erroneous characters.

(You implement this as a calculated field mirroring the imported field; or
you apply this as an auto-enter rule during import; or as a Replace with
calculated result using the Replace Field contents command.)

Unfortunately, if the data is incorrect in the source file I don't know of
another way to correct this.

The source data was correct. I expect the translation from AmigaDOS to
Windows was the problem.

Bill

"Cecil Bankston" <cbankston (AT) spamfreecox (DOT) net> wrote in message
news:wFQif.11966$mm5.504 (AT) dukeread03 (DOT) ..

When I imported data into FM from another database program, some of the
accented characters were changed into other ASCII or Unicode symbols. Is
there any way I can do a find in a field to locate all records with any
characters or symbols above the ASCII or Unicode range of standard English
letters and punctuation marks without having to do separate finds for each
individual symbol?


It should be possible to attack the problem by brute force defining an
allowed range, and then exclude that...

e.g.

Define a calculation:


Substitute (your field,
"a", "",
"b", "",
"c", "",
"d", "",
....
"z", "",

"A", "",
"B",
....
"Z", "",
"0", "",
....
"9", "",

then
comma, period, question mark, ampersand, percent, colon, semicolon,
dollar sign, asterisk, parenthesis (l+r), brackets (l+r), braces (l+r),
angle brackets (l+r), plus, minus, equals, exclamation, at, caret,
tilde, "space", "tab", "carriage return", slash, backslash, pound,
underscore, quote, apostrophe, left-apostrophe, pipe...

All told I'd wager there are only around 75 characters.
[Alphabet (26x2), digits (10), symbols (~20)

(Although, you could probably get away with omitting most of those
symbols, and just adding the ones you need to the calc expression if you
need them)

This calc ideally should be blank for all records once you've cleaned
everything up. So...do a find on that field for "=" (find anything)

And see what turns up. Keep "fixxing" characters, until nothing turns up
in that find anymore.

Then you are done.


Reply With Quote
  #5  
Old   
Cecil Bankston
 
Posts: n/a

Default Re: Finding a Range of Unicode Characters - 11-30-2005 , 03:11 PM



The brute force calculation method specified below probably would work.
I'll give it a try.

I will rephrase the question to a more generalized one of how or if one
can use find requests to locate records containing a list or range (as
opposed to individual) of literal special characters (accented letters
or symbols) occurring anywhere in a field. I expect it probably can't
be done.

For example: *"é"* will find é (accented e in case the server doesn't
show the accented character I inserted between the quotes) anywhere in a
field. If I enter *"é"*...*"ö"* (umlaut o) to try to find all the
accented or special characters between é and ö, nothing is found. If I
enter é...ö all records with words beginning with any non-accented
character between e and o are found. I know I could use the brute force
method of separate find requests for each letter, if FM can handle that
many requests in one find. Of course that would require knowing all the
possible characters to include, which I don't.

Quote:
When I imported data into FM from another database program, some of the
accented characters were changed into other ASCII or Unicode symbols. Is
there any way I can do a find in a field to locate all records with any
characters or symbols above the ASCII or Unicode range of standard English
letters and punctuation marks without having to do separate finds for each
individual symbol?



It should be possible to attack the problem by brute force defining an
allowed range, and then exclude that...

e.g.

Define a calculation:


Substitute (your field,
"a", "",
"b", "",
"c", "",
"d", "",
...
"z", "",

"A", "",
"B",
...
"Z", "",
"0", "",
...
"9", "",

then
comma, period, question mark, ampersand, percent, colon, semicolon,
dollar sign, asterisk, parenthesis (l+r), brackets (l+r), braces (l+r),
angle brackets (l+r), plus, minus, equals, exclamation, at, caret,
tilde, "space", "tab", "carriage return", slash, backslash, pound,
underscore, quote, apostrophe, left-apostrophe, pipe...

All told I'd wager there are only around 75 characters.
[Alphabet (26x2), digits (10), symbols (~20)

(Although, you could probably get away with omitting most of those
symbols, and just adding the ones you need to the calc expression if you
need them)

This calc ideally should be blank for all records once you've cleaned
everything up. So...do a find on that field for "=" (find anything)

And see what turns up. Keep "fixxing" characters, until nothing turns up
in that find anymore.

Then you are done.

--
Cecil N. Bankston
Baton Rouge, LA
USA


Reply With Quote
  #6  
Old   
42
 
Posts: n/a

Default Re: Finding a Range of Unicode Characters - 11-30-2005 , 04:03 PM



In article <7Aojf.195$oz5.167@dukeread03>, cbankston (AT) spamfreecox (DOT) net
says...
Quote:
The brute force calculation method specified below probably would work.
I'll give it a try.

I will rephrase the question to a more generalized one of how or if one
can use find requests to locate records containing a list or range (as
opposed to individual) of literal special characters (accented letters
or symbols) occurring anywhere in a field. I expect it probably can't
be done.
List yes, range no.

Quote:
For example: *"é"* will find é (accented e in case the server doesn't
show the accented character I inserted between the quotes)
Actually just the 'e' by itself will work, the *"e"* construct is
unnecessary.

Quote:
anywhere in a
field. If I enter *"é"*...*"ö"* (umlaut o) to try to find all the
accented or special characters between é and ö, nothing is found. IfI
enter é...ö all records with words beginning with any non-accented
character between e and o are found. I know I could use the brute force
method of separate find requests for each letter, if FM can handle that
many requests in one find.
It can, to any reasonable limit, at least. I've done dozens of requests,
but never had any need for hundreds or thousands, and haven't pushed it
that far.

Quote:
Of course that would require knowing all the
possible characters to include, which I don't.
Precisely.

And that's really an achilles heel in any case. I mean -- if you don't
know what all the characters are, how do you know that 'e' is the bottom
and 'o' is the top?

Regular expressions would make the brute force method much simpler
because you could easily specify it to be case insensitive and set the
range for alphabet [a...z], and digits [0..9], and then would only have
to manually specify the acceptable symbols. But regex are only available
via plugin.

-regards,
dave

Quote:
It should be possible to attack the problem by brute force defining an
allowed range, and then exclude that...

e.g.

Define a calculation:


Substitute (your field,
"a", "",
"b", "",
"c", "",
"d", "",
...
"z", "",

"A", "",
"B",
...
"Z", "",
"0", "",
...
"9", "",

then
comma, period, question mark, ampersand, percent, colon, semicolon,
dollar sign, asterisk, parenthesis (l+r), brackets (l+r), braces (l+r),
angle brackets (l+r), plus, minus, equals, exclamation, at, caret,
tilde, "space", "tab", "carriage return", slash, backslash, pound,
underscore, quote, apostrophe, left-apostrophe, pipe...

All told I'd wager there are only around 75 characters.
[Alphabet (26x2), digits (10), symbols (~20)

(Although, you could probably get away with omitting most of those
symbols, and just adding the ones you need to the calc expression if you
need them)

This calc ideally should be blank for all records once you've cleaned
everything up. So...do a find on that field for "=" (find anything)

And see what turns up. Keep "fixxing" characters, until nothing turns up
in that find anymore.

Then you are done.




Reply With Quote
  #7  
Old   
42
 
Posts: n/a

Default Re: Finding a Range of Unicode Characters - 11-30-2005 , 04:09 PM



In article <MPG.1df7c1687fa08f00989def (AT) shawnews (DOT) vf.shawcable.net>,
nospam (AT) nospam (DOT) com says...
Quote:
In article <7Aojf.195$oz5.167@dukeread03>, cbankston (AT) spamfreecox (DOT) net
says...
The brute force calculation method specified below probably would work.
I'll give it a try.

I will rephrase the question to a more generalized one of how or if one
can use find requests to locate records containing a list or range (as
opposed to individual) of literal special characters (accented letters
or symbols) occurring anywhere in a field. I expect it probably can't
be done.

List yes, range no.


For example: *"é"* will find é (accented e in case the server doesn't
show the accented character I inserted between the quotes)

Actually just the 'e' by itself will work, the *"e"* construct is
unnecessary.
Oops... no. You need the *'s but not the quotes. *e* works.


Reply With Quote
  #8  
Old   
Cecil Bankston
 
Posts: n/a

Default Re: Finding a Range of Unicode Characters - 11-30-2005 , 05:16 PM



42 wrote:

Quote:
In article <MPG.1df7c1687fa08f00989def (AT) shawnews (DOT) vf.shawcable.net>,
nospam (AT) nospam (DOT) com says...

In article <7Aojf.195$oz5.167@dukeread03>, cbankston (AT) spamfreecox (DOT) net
says...

The brute force calculation method specified below probably would work.
I'll give it a try.

I will rephrase the question to a more generalized one of how or if one
can use find requests to locate records containing a list or range (as
opposed to individual) of literal special characters (accented letters
or symbols) occurring anywhere in a field. I expect it probably can't
be done.

List yes, range no.


For example: *"é"* will find é (accented e in case the server doesn't
show the accented character I inserted between the quotes)

Actually just the 'e' by itself will work, the *"e"* construct is
unnecessary.


Oops... no. You need the *'s but not the quotes. *e* works.
Actually the quotes are required to find é (that's e with an accent).
Otherwise, all e's with and without accents are found. Unfortunately,
once the quotes are used, it seems the Find symbols such as > or ...
don't work in the find request as expected., even when the records are
sorted in Unicode order on the searched field. An example:
Quote:
"ö" (that's o with an umlaut) finds all records with words beginning
with characters >o (that's o with no accent), disregarding the order of
the characters in the Unicode character list.

--
Cecil N. Bankston
Baton Rouge, LA
USA


Reply With Quote
  #9  
Old   
Cecil Bankston
 
Posts: n/a

Default Re: Finding a Range of Unicode Characters - 11-30-2005 , 09:01 PM




Quote:
Regular expressions would make the brute force method much simpler
because you could easily specify it to be case insensitive and set the
range for alphabet [a...z], and digits [0..9], and then would only have
to manually specify the acceptable symbols. But regex are only available
via plugin.
Is there a particular plugin for FM7 you had in mind?


--
Cecil N. Bankston
Baton Rouge, LA
USA


Reply With Quote
  #10  
Old   
Bill Marriott
 
Posts: n/a

Default Re: Finding a Range of Unicode Characters - 11-30-2005 , 09:39 PM



Some fellow who makes a reg expression plugin posted his link on another
thread, reproduced for your convenience here (dunno if it works in Find
mode):

========================================
In addition to the solutions already proposed, you might also benefit from
looking into regular expressions. This is especially true if you anticipate
performing a variety of text transformations similar to the one you have
described.

An example of how you would collapse repeated characters using our regular
expression plug-in (yooMatch) is:

yooMatch_replace( "aaabberrrrtt"; "(.)\1+"; "\1"; "g" )

This function call would return "abert" as desired.

If your needs are fairly simple and limited to collapsing repeated
characters, then clearly you are well-served by using built-in functionality
as already discussed. If your needs are somewhat broader, I believe you will
find that regular expressions constitute an extremely useful tool for which
there is no reasonable substitute in many situations.

Darren
yooPlugs - http://www.yooplugs.com
========================================

"Cecil Bankston" <cbankston (AT) spamfreecox (DOT) net> wrote

Quote:
Regular expressions would make the brute force method much simpler
because you could easily specify it to be case insensitive and set the
range for alphabet [a...z], and digits [0..9], and then would only have
to manually specify the acceptable symbols. But regex are only available
via plugin.

Is there a particular plugin for FM7 you had in mind?


--
Cecil N. Bankston
Baton Rouge, LA
USA



Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.