![]() | |
![]() |
| | Thread Tools | Display Modes |
#1
| |||
| |||
|
#2
| |||
| |||
|
|
I have a small problem between maths and databases. |
|
I have a table in a database with 55 columns of reals and at least 200.000 lines (but up to 2.000.000 if I upload everything). So, I am in a 55 dimensions space with as many points as lines in my database. Here is my problem : I try to find the closest neighbour (limited to 1.000 points) of a point taken at random in the database. The distance is a classical euclidian distance in a 55 dimensions space. The brutal method is to browse the entire database, calculate the distance to the point for each row, order it and limit to the first 1000 rows. This is a single SQL instruction but it consumes a lot of resources. I suppose that it is possible to reduce the amount of calculus by eliminating points for which we know that they are to far away. Let's take an analogy : if you want to evaluate the closest cities to Chamonix - the famous French ski station! -, you will eliminate the cities situated in the south hemisphere as you know before evaluating anything that they are to far. Are there techniques to manage that kind of problem ? |
#3
| |||
| |||
|
|
I have a small problem between maths and databases. |
|
I have a table in a database with 55 columns of reals and at least 200.000 lines (but up to 2.000.000 if I upload everything). So, I am in a 55 dimensions space with as many points as lines in my database. Here is my problem : I try to find the closest neighbour (limited to 1.000 points) of a point taken at random in the database. The distance is a classical euclidian distance in a 55 dimensions space. The brutal method is to browse the entire database, calculate the distance to the point for each row, order it and limit to the first 1000 rows. This is a single SQL instruction but it consumes a lot of resources. I suppose that it is possible to reduce the amount of calculus by eliminating points for which we know that they are to far away. Let's take an analogy : if you want to evaluate the closest cities to Chamonix - the famous French ski station! -, you will eliminate the cities situated in the south hemisphere as you know before evaluating anything that they are to far. Are there techniques to manage that kind of problem ? |
#4
| |||
| |||
|
|
I have a small problem between maths and databases. |
|
I have a table in a database with 55 columns of reals and at least 200.000 lines (but up to 2.000.000 if I upload everything). So, I am in a 55 dimensions space with as many points as lines in my database. Here is my problem : I try to find the closest neighbour (limited to 1.000 points) of a point taken at random in the database. The distance is a classical euclidian distance in a 55 dimensions space. The brutal method is to browse the entire database, calculate the distance to the point for each row, order it and limit to the first 1000 rows. This is a single SQL instruction but it consumes a lot of resources. I suppose that it is possible to reduce the amount of calculus by eliminating points for which we know that they are to far away. Let's take an analogy : if you want to evaluate the closest cities to Chamonix - the famous French ski station! -, you will eliminate the cities situated in the south hemisphere as you know before evaluating anything that they are to far. Are there techniques to manage that kind of problem ? |
#5
| |||
| |||
|
|
I have a small problem between maths and databases. |
|
I have a table in a database with 55 columns of reals and at least 200.000 lines (but up to 2.000.000 if I upload everything). So, I am in a 55 dimensions space with as many points as lines in my database. Here is my problem : I try to find the closest neighbour (limited to 1.000 points) of a point taken at random in the database. The distance is a classical euclidian distance in a 55 dimensions space. The brutal method is to browse the entire database, calculate the distance to the point for each row, order it and limit to the first 1000 rows. This is a single SQL instruction but it consumes a lot of resources. I suppose that it is possible to reduce the amount of calculus by eliminating points for which we know that they are to far away. Let's take an analogy : if you want to evaluate the closest cities to Chamonix - the famous French ski station! -, you will eliminate the cities situated in the south hemisphere as you know before evaluating anything that they are to far. Are there techniques to manage that kind of problem ? |
#6
| |||
| |||
|
|
I have a small problem between maths and databases. |
|
I have a table in a database with 55 columns of reals and at least 200.000 lines (but up to 2.000.000 if I upload everything). So, I am in a 55 dimensions space with as many points as lines in my database. Here is my problem : I try to find the closest neighbour (limited to 1.000 points) of a point taken at random in the database. The distance is a classical euclidian distance in a 55 dimensions space. The brutal method is to browse the entire database, calculate the distance to the point for each row, order it and limit to the first 1000 rows. This is a single SQL instruction but it consumes a lot of resources. I suppose that it is possible to reduce the amount of calculus by eliminating points for which we know that they are to far away. Let's take an analogy : if you want to evaluate the closest cities to Chamonix - the famous French ski station! -, you will eliminate the cities situated in the south hemisphere as you know before evaluating anything that they are to far. Are there techniques to manage that kind of problem ? |
#7
| |||
| |||
|
|
I have a small problem between maths and databases. |
|
I have a table in a database with 55 columns of reals and at least 200.000 lines (but up to 2.000.000 if I upload everything). So, I am in a 55 dimensions space with as many points as lines in my database. Here is my problem : I try to find the closest neighbour (limited to 1.000 points) of a point taken at random in the database. The distance is a classical euclidian distance in a 55 dimensions space. The brutal method is to browse the entire database, calculate the distance to the point for each row, order it and limit to the first 1000 rows. This is a single SQL instruction but it consumes a lot of resources. I suppose that it is possible to reduce the amount of calculus by eliminating points for which we know that they are to far away. Let's take an analogy : if you want to evaluate the closest cities to Chamonix - the famous French ski station! -, you will eliminate the cities situated in the south hemisphere as you know before evaluating anything that they are to far. Are there techniques to manage that kind of problem ? |
#8
| |||
| |||
|
|
I have a small problem between maths and databases. |
|
I have a table in a database with 55 columns of reals and at least 200.000 lines (but up to 2.000.000 if I upload everything). So, I am in a 55 dimensions space with as many points as lines in my database. Here is my problem : I try to find the closest neighbour (limited to 1.000 points) of a point taken at random in the database. The distance is a classical euclidian distance in a 55 dimensions space. The brutal method is to browse the entire database, calculate the distance to the point for each row, order it and limit to the first 1000 rows. This is a single SQL instruction but it consumes a lot of resources. I suppose that it is possible to reduce the amount of calculus by eliminating points for which we know that they are to far away. Let's take an analogy : if you want to evaluate the closest cities to Chamonix - the famous French ski station! -, you will eliminate the cities situated in the south hemisphere as you know before evaluating anything that they are to far. Are there techniques to manage that kind of problem ? |
#9
| |||
| |||
|
|
I have a small problem between maths and databases. |
|
I have a table in a database with 55 columns of reals and at least 200.000 lines (but up to 2.000.000 if I upload everything). So, I am in a 55 dimensions space with as many points as lines in my database. Here is my problem : I try to find the closest neighbour (limited to 1.000 points) of a point taken at random in the database. The distance is a classical euclidian distance in a 55 dimensions space. The brutal method is to browse the entire database, calculate the distance to the point for each row, order it and limit to the first 1000 rows. This is a single SQL instruction but it consumes a lot of resources. I suppose that it is possible to reduce the amount of calculus by eliminating points for which we know that they are to far away. Let's take an analogy : if you want to evaluate the closest cities to Chamonix - the famous French ski station! -, you will eliminate the cities situated in the south hemisphere as you know before evaluating anything that they are to far. Are there techniques to manage that kind of problem ? |
#10
| |||
| |||
|
|
Alain Reymond <arwebmail (AT) skynet (DOT) be> wrote: I have a small problem between maths and databases. I hope I don't answer anything too silly that will shame my master degrees in both of these fields... I have a table in a database with 55 columns of reals and at least 200.000 lines (but up to 2.000.000 if I upload everything). So, I am in a 55 dimensions space with as many points as lines in my database. Here is my problem : I try to find the closest neighbour (limited to 1.000 points) of a point taken at random in the database. The distance is a classical euclidian distance in a 55 dimensions space. The brutal method is to browse the entire database, calculate the distance to the point for each row, order it and limit to the first 1000 rows. This is a single SQL instruction but it consumes a lot of resources. I suppose that it is possible to reduce the amount of calculus by eliminating points for which we know that they are to far away. Let's take an analogy : if you want to evaluate the closest cities to Chamonix - the famous French ski station! -, you will eliminate the cities situated in the south hemisphere as you know before evaluating anything that they are to far. Are there techniques to manage that kind of problem ? First, there may be something in PostGIS that could help with things like that. But I don't know anything about PostGIS, and I doubt that it can handle high dimensional spaces. You may want to look at it though. My gut feeling is that the "brutal" method (scanning through all rows, calculate the distance, take the minimum) will actually be the best method. True, it would be easy to optimize the computational effort along the lines you describe; e.g. you remember a "current minimal distance" and discard all rows where the difference in one coordinate exceeds that minimum. The problem here is that unless you can somehow use an index to select interesting rows, that won't get you anything. Accessing the table row on the disk takes far longer than it takes the processor to calculate the euclidian distance, so even if you reduce the amount of computation you need, there will be no real gain in response time. And I cannot think of an index that could be used for your problem. The positive point of view is that your query will not run substantially longer than a "SELECT * FROM tablename" would, which in the case of 2000000 rows might be OK with decent hardware. It depends on your requirements of course. Maybe somebody else can think of a smart index that could help here... Yours, Laurenz Albe Vielen Dank Laurenz. |
![]() |
| Thread Tools | |
| Display Modes | |
| |