dbTalk Databases Forums  

Distance and database

comp.databases.postgresql comp.databases.postgresql


Discuss Distance and database in the comp.databases.postgresql forum.



Reply
 
Thread Tools Display Modes
  #1  
Old   
Alain Reymond
 
Posts: n/a

Default Distance and database - 10-07-2008 , 11:39 AM






Hello,

I have a small problem between maths and databases.

I have a table in a database with 55 columns of reals and at least 200.000 lines (but up to 2.000.000 if I upload everything). So, I am in a 55 dimensions space with as many points as lines in my database.

Here is my problem : I try to find the closest neighbour (limited to 1.000 points) of a point taken at random in the database. The distance is a classical euclidian distance in a 55 dimensions space.

The brutal method is to browse the entire database, calculate the distance to the point for each row, order it and limit to the first 1000 rows. This is a single SQL instruction but it consumes a lot of resources.

I suppose that it is possible to reduce the amount of calculus by eliminating points for which we know that they are to far away. Let's take an analogy : if you want to evaluate the closest cities to Chamonix - the famous French ski station! -, you will eliminate the cities situated in the south hemisphere as you know before evaluating anything that they are to far.

Are there techniques to manage that kind of problem ?

Best regards


Alain


Reply With Quote
  #2  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: Distance and database - 10-08-2008 , 01:58 AM






Alain Reymond <arwebmail (AT) skynet (DOT) be> wrote:
Quote:
I have a small problem between maths and databases.
I hope I don't answer anything too silly that will shame my
master degrees in both of these fields...

Quote:
I have a table in a database with 55 columns of reals and at least
200.000 lines (but up to 2.000.000 if I upload everything). So, I am
in a 55 dimensions space with as many points as lines in my database.

Here is my problem : I try to find the closest neighbour (limited to
1.000 points) of a point taken at random in the database. The distance
is a classical euclidian distance in a 55 dimensions space.

The brutal method is to browse the entire database, calculate the
distance to the point for each row, order it and limit to the first 1000
rows. This is a single SQL instruction but it consumes a lot of resources.

I suppose that it is possible to reduce the amount of calculus by
eliminating points for which we know that they are to far away. Let's
take an analogy : if you want to evaluate the closest cities to
Chamonix - the famous French ski station! -, you will eliminate the
cities situated in the south hemisphere as you know before evaluating
anything that they are to far.

Are there techniques to manage that kind of problem ?
First, there may be something in PostGIS that could help with things like
that. But I don't know anything about PostGIS, and I doubt that it can
handle high dimensional spaces. You may want to look at it though.

My gut feeling is that the "brutal" method (scanning through all rows,
calculate the distance, take the minimum) will actually be the best
method.

True, it would be easy to optimize the computational effort along the
lines you describe; e.g. you remember a "current minimal distance" and
discard all rows where the difference in one coordinate exceeds that
minimum.

The problem here is that unless you can somehow use an index to select
interesting rows, that won't get you anything. Accessing the table row on
the disk takes far longer than it takes the processor to calculate the
euclidian distance, so even if you reduce the amount of computation you
need, there will be no real gain in response time.

And I cannot think of an index that could be used for your problem.

The positive point of view is that your query will not run substantially
longer than a "SELECT * FROM tablename" would, which in the case of
2000000 rows might be OK with decent hardware. It depends on your
requirements of course.

Maybe somebody else can think of a smart index that could help here...

Yours,
Laurenz Albe


Reply With Quote
  #3  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: Distance and database - 10-08-2008 , 01:58 AM



Alain Reymond <arwebmail (AT) skynet (DOT) be> wrote:
Quote:
I have a small problem between maths and databases.
I hope I don't answer anything too silly that will shame my
master degrees in both of these fields...

Quote:
I have a table in a database with 55 columns of reals and at least
200.000 lines (but up to 2.000.000 if I upload everything). So, I am
in a 55 dimensions space with as many points as lines in my database.

Here is my problem : I try to find the closest neighbour (limited to
1.000 points) of a point taken at random in the database. The distance
is a classical euclidian distance in a 55 dimensions space.

The brutal method is to browse the entire database, calculate the
distance to the point for each row, order it and limit to the first 1000
rows. This is a single SQL instruction but it consumes a lot of resources.

I suppose that it is possible to reduce the amount of calculus by
eliminating points for which we know that they are to far away. Let's
take an analogy : if you want to evaluate the closest cities to
Chamonix - the famous French ski station! -, you will eliminate the
cities situated in the south hemisphere as you know before evaluating
anything that they are to far.

Are there techniques to manage that kind of problem ?
First, there may be something in PostGIS that could help with things like
that. But I don't know anything about PostGIS, and I doubt that it can
handle high dimensional spaces. You may want to look at it though.

My gut feeling is that the "brutal" method (scanning through all rows,
calculate the distance, take the minimum) will actually be the best
method.

True, it would be easy to optimize the computational effort along the
lines you describe; e.g. you remember a "current minimal distance" and
discard all rows where the difference in one coordinate exceeds that
minimum.

The problem here is that unless you can somehow use an index to select
interesting rows, that won't get you anything. Accessing the table row on
the disk takes far longer than it takes the processor to calculate the
euclidian distance, so even if you reduce the amount of computation you
need, there will be no real gain in response time.

And I cannot think of an index that could be used for your problem.

The positive point of view is that your query will not run substantially
longer than a "SELECT * FROM tablename" would, which in the case of
2000000 rows might be OK with decent hardware. It depends on your
requirements of course.

Maybe somebody else can think of a smart index that could help here...

Yours,
Laurenz Albe


Reply With Quote
  #4  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: Distance and database - 10-08-2008 , 01:58 AM



Alain Reymond <arwebmail (AT) skynet (DOT) be> wrote:
Quote:
I have a small problem between maths and databases.
I hope I don't answer anything too silly that will shame my
master degrees in both of these fields...

Quote:
I have a table in a database with 55 columns of reals and at least
200.000 lines (but up to 2.000.000 if I upload everything). So, I am
in a 55 dimensions space with as many points as lines in my database.

Here is my problem : I try to find the closest neighbour (limited to
1.000 points) of a point taken at random in the database. The distance
is a classical euclidian distance in a 55 dimensions space.

The brutal method is to browse the entire database, calculate the
distance to the point for each row, order it and limit to the first 1000
rows. This is a single SQL instruction but it consumes a lot of resources.

I suppose that it is possible to reduce the amount of calculus by
eliminating points for which we know that they are to far away. Let's
take an analogy : if you want to evaluate the closest cities to
Chamonix - the famous French ski station! -, you will eliminate the
cities situated in the south hemisphere as you know before evaluating
anything that they are to far.

Are there techniques to manage that kind of problem ?
First, there may be something in PostGIS that could help with things like
that. But I don't know anything about PostGIS, and I doubt that it can
handle high dimensional spaces. You may want to look at it though.

My gut feeling is that the "brutal" method (scanning through all rows,
calculate the distance, take the minimum) will actually be the best
method.

True, it would be easy to optimize the computational effort along the
lines you describe; e.g. you remember a "current minimal distance" and
discard all rows where the difference in one coordinate exceeds that
minimum.

The problem here is that unless you can somehow use an index to select
interesting rows, that won't get you anything. Accessing the table row on
the disk takes far longer than it takes the processor to calculate the
euclidian distance, so even if you reduce the amount of computation you
need, there will be no real gain in response time.

And I cannot think of an index that could be used for your problem.

The positive point of view is that your query will not run substantially
longer than a "SELECT * FROM tablename" would, which in the case of
2000000 rows might be OK with decent hardware. It depends on your
requirements of course.

Maybe somebody else can think of a smart index that could help here...

Yours,
Laurenz Albe


Reply With Quote
  #5  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: Distance and database - 10-08-2008 , 01:58 AM



Alain Reymond <arwebmail (AT) skynet (DOT) be> wrote:
Quote:
I have a small problem between maths and databases.
I hope I don't answer anything too silly that will shame my
master degrees in both of these fields...

Quote:
I have a table in a database with 55 columns of reals and at least
200.000 lines (but up to 2.000.000 if I upload everything). So, I am
in a 55 dimensions space with as many points as lines in my database.

Here is my problem : I try to find the closest neighbour (limited to
1.000 points) of a point taken at random in the database. The distance
is a classical euclidian distance in a 55 dimensions space.

The brutal method is to browse the entire database, calculate the
distance to the point for each row, order it and limit to the first 1000
rows. This is a single SQL instruction but it consumes a lot of resources.

I suppose that it is possible to reduce the amount of calculus by
eliminating points for which we know that they are to far away. Let's
take an analogy : if you want to evaluate the closest cities to
Chamonix - the famous French ski station! -, you will eliminate the
cities situated in the south hemisphere as you know before evaluating
anything that they are to far.

Are there techniques to manage that kind of problem ?
First, there may be something in PostGIS that could help with things like
that. But I don't know anything about PostGIS, and I doubt that it can
handle high dimensional spaces. You may want to look at it though.

My gut feeling is that the "brutal" method (scanning through all rows,
calculate the distance, take the minimum) will actually be the best
method.

True, it would be easy to optimize the computational effort along the
lines you describe; e.g. you remember a "current minimal distance" and
discard all rows where the difference in one coordinate exceeds that
minimum.

The problem here is that unless you can somehow use an index to select
interesting rows, that won't get you anything. Accessing the table row on
the disk takes far longer than it takes the processor to calculate the
euclidian distance, so even if you reduce the amount of computation you
need, there will be no real gain in response time.

And I cannot think of an index that could be used for your problem.

The positive point of view is that your query will not run substantially
longer than a "SELECT * FROM tablename" would, which in the case of
2000000 rows might be OK with decent hardware. It depends on your
requirements of course.

Maybe somebody else can think of a smart index that could help here...

Yours,
Laurenz Albe


Reply With Quote
  #6  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: Distance and database - 10-08-2008 , 01:58 AM



Alain Reymond <arwebmail (AT) skynet (DOT) be> wrote:
Quote:
I have a small problem between maths and databases.
I hope I don't answer anything too silly that will shame my
master degrees in both of these fields...

Quote:
I have a table in a database with 55 columns of reals and at least
200.000 lines (but up to 2.000.000 if I upload everything). So, I am
in a 55 dimensions space with as many points as lines in my database.

Here is my problem : I try to find the closest neighbour (limited to
1.000 points) of a point taken at random in the database. The distance
is a classical euclidian distance in a 55 dimensions space.

The brutal method is to browse the entire database, calculate the
distance to the point for each row, order it and limit to the first 1000
rows. This is a single SQL instruction but it consumes a lot of resources.

I suppose that it is possible to reduce the amount of calculus by
eliminating points for which we know that they are to far away. Let's
take an analogy : if you want to evaluate the closest cities to
Chamonix - the famous French ski station! -, you will eliminate the
cities situated in the south hemisphere as you know before evaluating
anything that they are to far.

Are there techniques to manage that kind of problem ?
First, there may be something in PostGIS that could help with things like
that. But I don't know anything about PostGIS, and I doubt that it can
handle high dimensional spaces. You may want to look at it though.

My gut feeling is that the "brutal" method (scanning through all rows,
calculate the distance, take the minimum) will actually be the best
method.

True, it would be easy to optimize the computational effort along the
lines you describe; e.g. you remember a "current minimal distance" and
discard all rows where the difference in one coordinate exceeds that
minimum.

The problem here is that unless you can somehow use an index to select
interesting rows, that won't get you anything. Accessing the table row on
the disk takes far longer than it takes the processor to calculate the
euclidian distance, so even if you reduce the amount of computation you
need, there will be no real gain in response time.

And I cannot think of an index that could be used for your problem.

The positive point of view is that your query will not run substantially
longer than a "SELECT * FROM tablename" would, which in the case of
2000000 rows might be OK with decent hardware. It depends on your
requirements of course.

Maybe somebody else can think of a smart index that could help here...

Yours,
Laurenz Albe


Reply With Quote
  #7  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: Distance and database - 10-08-2008 , 01:58 AM



Alain Reymond <arwebmail (AT) skynet (DOT) be> wrote:
Quote:
I have a small problem between maths and databases.
I hope I don't answer anything too silly that will shame my
master degrees in both of these fields...

Quote:
I have a table in a database with 55 columns of reals and at least
200.000 lines (but up to 2.000.000 if I upload everything). So, I am
in a 55 dimensions space with as many points as lines in my database.

Here is my problem : I try to find the closest neighbour (limited to
1.000 points) of a point taken at random in the database. The distance
is a classical euclidian distance in a 55 dimensions space.

The brutal method is to browse the entire database, calculate the
distance to the point for each row, order it and limit to the first 1000
rows. This is a single SQL instruction but it consumes a lot of resources.

I suppose that it is possible to reduce the amount of calculus by
eliminating points for which we know that they are to far away. Let's
take an analogy : if you want to evaluate the closest cities to
Chamonix - the famous French ski station! -, you will eliminate the
cities situated in the south hemisphere as you know before evaluating
anything that they are to far.

Are there techniques to manage that kind of problem ?
First, there may be something in PostGIS that could help with things like
that. But I don't know anything about PostGIS, and I doubt that it can
handle high dimensional spaces. You may want to look at it though.

My gut feeling is that the "brutal" method (scanning through all rows,
calculate the distance, take the minimum) will actually be the best
method.

True, it would be easy to optimize the computational effort along the
lines you describe; e.g. you remember a "current minimal distance" and
discard all rows where the difference in one coordinate exceeds that
minimum.

The problem here is that unless you can somehow use an index to select
interesting rows, that won't get you anything. Accessing the table row on
the disk takes far longer than it takes the processor to calculate the
euclidian distance, so even if you reduce the amount of computation you
need, there will be no real gain in response time.

And I cannot think of an index that could be used for your problem.

The positive point of view is that your query will not run substantially
longer than a "SELECT * FROM tablename" would, which in the case of
2000000 rows might be OK with decent hardware. It depends on your
requirements of course.

Maybe somebody else can think of a smart index that could help here...

Yours,
Laurenz Albe


Reply With Quote
  #8  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: Distance and database - 10-08-2008 , 01:58 AM



Alain Reymond <arwebmail (AT) skynet (DOT) be> wrote:
Quote:
I have a small problem between maths and databases.
I hope I don't answer anything too silly that will shame my
master degrees in both of these fields...

Quote:
I have a table in a database with 55 columns of reals and at least
200.000 lines (but up to 2.000.000 if I upload everything). So, I am
in a 55 dimensions space with as many points as lines in my database.

Here is my problem : I try to find the closest neighbour (limited to
1.000 points) of a point taken at random in the database. The distance
is a classical euclidian distance in a 55 dimensions space.

The brutal method is to browse the entire database, calculate the
distance to the point for each row, order it and limit to the first 1000
rows. This is a single SQL instruction but it consumes a lot of resources.

I suppose that it is possible to reduce the amount of calculus by
eliminating points for which we know that they are to far away. Let's
take an analogy : if you want to evaluate the closest cities to
Chamonix - the famous French ski station! -, you will eliminate the
cities situated in the south hemisphere as you know before evaluating
anything that they are to far.

Are there techniques to manage that kind of problem ?
First, there may be something in PostGIS that could help with things like
that. But I don't know anything about PostGIS, and I doubt that it can
handle high dimensional spaces. You may want to look at it though.

My gut feeling is that the "brutal" method (scanning through all rows,
calculate the distance, take the minimum) will actually be the best
method.

True, it would be easy to optimize the computational effort along the
lines you describe; e.g. you remember a "current minimal distance" and
discard all rows where the difference in one coordinate exceeds that
minimum.

The problem here is that unless you can somehow use an index to select
interesting rows, that won't get you anything. Accessing the table row on
the disk takes far longer than it takes the processor to calculate the
euclidian distance, so even if you reduce the amount of computation you
need, there will be no real gain in response time.

And I cannot think of an index that could be used for your problem.

The positive point of view is that your query will not run substantially
longer than a "SELECT * FROM tablename" would, which in the case of
2000000 rows might be OK with decent hardware. It depends on your
requirements of course.

Maybe somebody else can think of a smart index that could help here...

Yours,
Laurenz Albe


Reply With Quote
  #9  
Old   
Laurenz Albe
 
Posts: n/a

Default Re: Distance and database - 10-08-2008 , 01:58 AM



Alain Reymond <arwebmail (AT) skynet (DOT) be> wrote:
Quote:
I have a small problem between maths and databases.
I hope I don't answer anything too silly that will shame my
master degrees in both of these fields...

Quote:
I have a table in a database with 55 columns of reals and at least
200.000 lines (but up to 2.000.000 if I upload everything). So, I am
in a 55 dimensions space with as many points as lines in my database.

Here is my problem : I try to find the closest neighbour (limited to
1.000 points) of a point taken at random in the database. The distance
is a classical euclidian distance in a 55 dimensions space.

The brutal method is to browse the entire database, calculate the
distance to the point for each row, order it and limit to the first 1000
rows. This is a single SQL instruction but it consumes a lot of resources.

I suppose that it is possible to reduce the amount of calculus by
eliminating points for which we know that they are to far away. Let's
take an analogy : if you want to evaluate the closest cities to
Chamonix - the famous French ski station! -, you will eliminate the
cities situated in the south hemisphere as you know before evaluating
anything that they are to far.

Are there techniques to manage that kind of problem ?
First, there may be something in PostGIS that could help with things like
that. But I don't know anything about PostGIS, and I doubt that it can
handle high dimensional spaces. You may want to look at it though.

My gut feeling is that the "brutal" method (scanning through all rows,
calculate the distance, take the minimum) will actually be the best
method.

True, it would be easy to optimize the computational effort along the
lines you describe; e.g. you remember a "current minimal distance" and
discard all rows where the difference in one coordinate exceeds that
minimum.

The problem here is that unless you can somehow use an index to select
interesting rows, that won't get you anything. Accessing the table row on
the disk takes far longer than it takes the processor to calculate the
euclidian distance, so even if you reduce the amount of computation you
need, there will be no real gain in response time.

And I cannot think of an index that could be used for your problem.

The positive point of view is that your query will not run substantially
longer than a "SELECT * FROM tablename" would, which in the case of
2000000 rows might be OK with decent hardware. It depends on your
requirements of course.

Maybe somebody else can think of a smart index that could help here...

Yours,
Laurenz Albe


Reply With Quote
  #10  
Old   
Alain Reymond
 
Posts: n/a

Default Re: Distance and database - 10-08-2008 , 04:29 AM





Laurenz Albe a écrit :
Quote:
Alain Reymond <arwebmail (AT) skynet (DOT) be> wrote:

I have a small problem between maths and databases.


I hope I don't answer anything too silly that will shame my
master degrees in both of these fields...


I have a table in a database with 55 columns of reals and at least
200.000 lines (but up to 2.000.000 if I upload everything). So, I am
in a 55 dimensions space with as many points as lines in my database.

Here is my problem : I try to find the closest neighbour (limited to
1.000 points) of a point taken at random in the database. The distance
is a classical euclidian distance in a 55 dimensions space.

The brutal method is to browse the entire database, calculate the
distance to the point for each row, order it and limit to the first 1000
rows. This is a single SQL instruction but it consumes a lot of resources.

I suppose that it is possible to reduce the amount of calculus by
eliminating points for which we know that they are to far away. Let's
take an analogy : if you want to evaluate the closest cities to
Chamonix - the famous French ski station! -, you will eliminate the
cities situated in the south hemisphere as you know before evaluating
anything that they are to far.

Are there techniques to manage that kind of problem ?


First, there may be something in PostGIS that could help with things like
that. But I don't know anything about PostGIS, and I doubt that it can
handle high dimensional spaces. You may want to look at it though.

My gut feeling is that the "brutal" method (scanning through all rows,
calculate the distance, take the minimum) will actually be the best
method.

True, it would be easy to optimize the computational effort along the
lines you describe; e.g. you remember a "current minimal distance" and
discard all rows where the difference in one coordinate exceeds that
minimum.

The problem here is that unless you can somehow use an index to select
interesting rows, that won't get you anything. Accessing the table row on
the disk takes far longer than it takes the processor to calculate the
euclidian distance, so even if you reduce the amount of computation you
need, there will be no real gain in response time.

And I cannot think of an index that could be used for your problem.

The positive point of view is that your query will not run substantially
longer than a "SELECT * FROM tablename" would, which in the case of
2000000 rows might be OK with decent hardware. It depends on your
requirements of course.

Maybe somebody else can think of a smart index that could help here...

Yours,
Laurenz Albe

Vielen Dank Laurenz.

I have got some information and the problem can be related to kd-trees
and knn-trees.
The search must be much faster but I do not know if there is any
experience with postgres and sql in that field.

Mit freundlichen Gruessen,

Alain





Reply With Quote
Reply




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Powered by vBulletin Version 3.5.3
Copyright ©2000 - 2012, Jelsoft Enterprises Ltd.