Information-theoretical Methods in Clustering
Abstract
Clustering is the process of organizing a set of objects in a
meaningful way. Objects characterized by categorical attributes
present a special challenge because the domains of the attributes lack
a natural ordering. This thesis introduces information-theoretical
measures that facilitate the definition of a distance between the
partitions determined by the attributes of a database. These measures
are based on a concave, sub-additive function, which allows us to
generalize the notion of entropy and conditional entropy. We present
their properties and application to three clustering problems related
to categorical databases.
- In subspace clustering we want to find all sets of attributes
that induce a partitioning of the database objects into a specified
number of blocks. This can be done by imposing a threshold on the
value of the entropy associated with a set of attributes. This
threshold represents the maximum entropy of a partition having a
prescribed number of clusters. We introduce a new algorithm for
discovering all maximal sets of attributes that partition the set of
objects in the required number of blocks.
- Finding the median partition is the problem of discovering a
partition as similar as possible to all partitions determined by the
attributes of an input database. We introduce a distance measure
between partitions based on the notion of generalized conditional
entropy. Discovering the median partition is computationally
expensive, due to the large number of possible partitions, and to
search more efficiently this space, we use a genetic algorithm
approach.
- Using a sample database, where each object is tagged with its
appropriate class, we can estimate the influence of each attribute on
the determination of an object's class. We propose a new genetic
algorithm for finding the natural clustering of the objects
represented by the partition that preserves the same attribute
influences as the estimated values. We discuss the type of databases
for which our clustering method gives good results and we present a
measure to assess this quality.
Created on September 18th, 2002