Semester Project

The k-nearest neighbor problem is well known and studied in many fields. The problem is given a set of data items D and a query point Q, find the k elements in D that are closest to Q. The naive algorithm for this problem is to simply compute the distance between Q and every point in D, and remember the k smallest distances. In databases, we use indexes to improve the performance for many types of queries. Your project is to implement a multi-dimensional index to improve the performance of knn queries.

Unfortunately, the "curse of dimensionality" implies that as the number of dimensions increases, the performace of multi-dimensional indexes approaches that of a brute force algorithm. The same may be true for your index as well. In order to see if you can break the curse, your must compare your results against the brute force algorithm on a per dimension basis. In other words, one metric for your project will be (brute force time)/((your time)*(number of dimensions)). I will provide the brute force algorithm for you.

The paper Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases provides an overview of several multidimensional indexes and the curse of dimensionality. However, it is now several years old and newer indexes have been published. You may select an index from the paper, or an index from a newer paper. However, I must approve your selection. Well known and often studied indexes (like R*) will not be approved.

Databases access data without a priori knowledge of the structure. Your index will have to do the same. You may assume the existance of a file called schema.txt. Each line of this file contains 4 data items. The first is the name of the table (the data will be stored in a file with the same name as the table), the second is the name of an attribute, the third is the type of the attribute (for simplicity, the types will be either integer or varchar(x) where x is the fixed length of the string), and the fourth is a single character, 'T' if the attribute is a dimension attribute and 'F' if it is not. Distance between points is determined only by dimension attributes. For further simplicity, only integers will be used as dimension attributes. Update! As a simplification, you can assume the fill will contain one varchar identifier attribute followed by the domain attributes. As a result, schema.txt will have one line per file with columns tablename,identifier length, number of domain attributes. However, there may be multiple tables in the schema file.

Your program will accept command line options of the form: knn k q -- where q can be any number of integers, up to a max. Update! Command line will be of the form: knn table k q. The length of q represents the number of dimnesions to be used in the query. If your project does not have an index of length q, it is to create one dynamically. The results of the query should be the file offset of the k nearest neighbors with each distance from the query point q. The results must be sorted from least to greatest distance. For testing purposes, the tables will not be known in advance.

Your project must also perform a series of tests over different size dimensions with different size data files. The dimension tests will be the powers of 2 from 0 to 6 (7 tests total). The data files will be 1000, 10,000, and 100,000 rows long. I will provide the data sets and the schema for the test data. You are to report on the time required to perform all these queries versus the time of the brute force algorithm, using dimension free and dimensional statistics.

Here is a test schema.

Here is a new test schema, fitting the update.

Here is a test data file that goes with the schema. The 1000 row table and the 10,000 row table are now available.

Here is the brute force program. Please let me know if you find some bugs. Timing information has been added. Note that the distance returned is the square of the distance. This is a java jar file. To execute it, use the command

java -jar knn.jar tname k (data point)

The data file and schema.txt must be in the same directory as the jar file.