![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
We need to 'filter' real-time transactions that can contain names and addresses against a blacklist of names and address held in a database. Is there a good 'standard' way of doing this given that there may be spelling or format differences between the original and the blacklist? |
#3
| |||
| |||
|
|
nowhere (AT) home (DOT) com wrote: We need to 'filter' real-time transactions that can contain names and addresses against a blacklist of names and address held in a database. Is there a good 'standard' way of doing this given that there may be spelling or format differences between the original and the blacklist? Matching names (people's names, as well as street or city names) might be done with a hueristic like double metaphone http://www.nist.gov/dads/HTML/doubleMetaphone.html or soundex http://www.nist.gov/dads/HTML/soundex.html |
#4
| |||
| |||
|
|
We need to 'filter' real-time transactions that can contain names and addresses against a blacklist of names and address held in a database. Is there a good 'standard' way of doing this given that there may be spelling or format differences between the original and the blacklist? The addresses would probably not be limited to one single country so I doubt we could make many assumptions about the address formats. Ideally we want to calculate a number which gives a 'closeness' to each name/address on the blacklist. If the maximum value calculated is above some threshold we can assume that the person is blacklisted. Also if anyone knows of a reasonably priced library which could do this then we would also be interested. The code to implement this would probably be written in C or Java, if that makes a difference. As this would be a real-time filter then speed would be a major factor in deciding what solution to pick. As yet I have no accuracy requirements for this project. If anyone has any useful suggestions I would be please to read them. |
#5
| |||
| |||
|
|
nowhere (AT) home (DOT) com> wrote: We need to 'filter' real-time transactions that can contain names and addresses against a blacklist of names and address held in a database. Is there a good 'standard' way of doing this given that there may be spelling or format differences between the original and the blacklist? The addresses would probably not be limited to one single country so I doubt we could make many assumptions about the address formats. Ideally we want to calculate a number which gives a 'closeness' to each name/address on the blacklist. If the maximum value calculated is above some threshold we can assume that the person is blacklisted. Also if anyone knows of a reasonably priced library which could do this then we would also be interested. The code to implement this would probably be written in C or Java, if that makes a difference. As this would be a real-time filter then speed would be a major factor in deciding what solution to pick. As yet I have no accuracy requirements for this project. If anyone has any useful suggestions I would be please to read them. Knuth describes a method called 'soundex' in Vol 3 p. 391. Googling on soundex might be worthwhile. |
#6
| |||
| |||
|
|
We need to 'filter' real-time transactions that can contain names and addresses against a blacklist of names and address held in a |
#7
| |||
| |||
|
|
In article <qa2rgvoo0m8r7fj4epus11khcqsk10gse3 (AT) 4ax (DOT) com>, nowhere (AT) home (DOT) com wrote: We need to 'filter' real-time transactions that can contain names and addresses against a blacklist of names and address held in a database. Is there a good 'standard' way of doing this given that there may be spelling or format differences between the original and the blacklist? I have code which is an extension of code I found on the net (so it's yours for free but if you want help with it I'll have to charge as I'm very busy) The original didn't say anything about the algorithm but based on reading since, I think it's Metaphone. I enhanced it considerable to cope with Latin names in plants (eg: eucalypt found by yookalipd). It was used very successfully for matching street names years ago here in Perth for a bike hazard reporting system which had to work out reports of the same hazard. The code to implement this would probably be written in C or Java The code is available in 4th Dimension (proprietary 4GL langauge) or C++. |
#8
| |||
| |||
|
|
osmium wrote: nowhere (AT) home (DOT) com> wrote: We need to 'filter' real-time transactions that can contain names and addresses against a blacklist of names and address held in a database. Is there a good 'standard' way of doing this given that there may be spelling or format differences between the original and the blacklist? The addresses would probably not be limited to one single country so I doubt we could make many assumptions about the address formats. Ideally we want to calculate a number which gives a 'closeness' to each name/address on the blacklist. If the maximum value calculated is above some threshold we can assume that the person is blacklisted. Also if anyone knows of a reasonably priced library which could do this then we would also be interested. The code to implement this would probably be written in C or Java, if that makes a difference. As this would be a real-time filter then speed would be a major factor in deciding what solution to pick. As yet I have no accuracy requirements for this project. If anyone has any useful suggestions I would be please to read them. Knuth describes a method called 'soundex' in Vol 3 p. 391. Googling on soundex might be worthwhile. It's a lot older than Knuth. It predates computers. I built it into ParseRat (http://www.parserat.com) as an option to assist in de-duplicating lists. The algorithm is simple, but breaks down with sound-alike INITIAL letters. e.g. it won't match "phone" and "fone". -- Ed Guy P.Eng,CDP,MIEE Information Technology Consultant Internet: ed (AT) guysoftware (DOT) com http://www.guysoftware.com "Check out HELLLP!, WinHelp author tool for WinWord 2.0 through 8.0, PlanBee Project Management Planning System and ParseRat, the File Parser, Converter and Reorganizer" Try predicating the cleaned list with a specific letter and then |
#9
| |||
| |||
|
|
"Ed Guy" <ed_guy (AT) shaw (DOT) ca> wrote osmium wrote: nowhere (AT) home (DOT) com> wrote: We need to 'filter' real-time transactions that can contain names and addresses against a blacklist of names and address held in a database. Is there a good 'standard' way of doing this given that there may be spelling or format differences between the original and the blacklist? The addresses would probably not be limited to one single country so I doubt we could make many assumptions about the address formats. Ideally we want to calculate a number which gives a 'closeness' to each name/address on the blacklist. If the maximum value calculated is above some threshold we can assume that the person is blacklisted. Also if anyone knows of a reasonably priced library which could do this then we would also be interested. The code to implement this would probably be written in C or Java, if that makes a difference. As this would be a real-time filter then speed would be a major factor in deciding what solution to pick. As yet I have no accuracy requirements for this project. If anyone has any useful suggestions I would be please to read them. Knuth describes a method called 'soundex' in Vol 3 p. 391. Googling on soundex might be worthwhile. It's a lot older than Knuth. It predates computers. I built it into ParseRat (http://www.parserat.com) as an option to assist in de-duplicating lists. The algorithm is simple, but breaks down with sound-alike INITIAL letters. e.g. it won't match "phone" and "fone". -- Ed Guy P.Eng,CDP,MIEE Information Technology Consultant Internet: ed (AT) guysoftware (DOT) com http://www.guysoftware.com "Check out HELLLP!, WinHelp author tool for WinWord 2.0 through 8.0, PlanBee Project Management Planning System and ParseRat, the File Parser, Converter and Reorganizer" Try predicating the cleaned list with a specific letter and then re-soundexing. |
![]() |
| Thread Tools | |
| Display Modes | |
| |