Assume we are given n data points x1,...,xn with xk=(x1k,...,xmk), we want to compute an index for each data point. For our purpose, an index is a function in x, i. e.
A good index is one which spreads the data points as much as possible.
That is, if we make a histogram of the values of index(x) for every point x, a good index will look like
while a bad index will cluster around its center
let's define the optimal index as the index which maximizes the variance of
with respect to all data points for
Let X be the random variable taking the value
for a randomly chosen k. Then the problem can be expressed as follows
|