Information-theoretical Methods in Clustering



Abstract



Clustering is the process of organizing a set of objects in a meaningful way. Objects characterized by categorical attributes present a special challenge because the domains of the attributes lack a natural ordering. This thesis introduces information-theoretical measures that facilitate the definition of a distance between the partitions determined by the attributes of a database. These measures are based on a concave, sub-additive function, which allows us to generalize the notion of entropy and conditional entropy. We present their properties and application to three clustering problems related to categorical databases.

  1. In subspace clustering we want to find all sets of attributes that induce a partitioning of the database objects into a specified number of blocks. This can be done by imposing a threshold on the value of the entropy associated with a set of attributes. This threshold represents the maximum entropy of a partition having a prescribed number of clusters. We introduce a new algorithm for discovering all maximal sets of attributes that partition the set of objects in the required number of blocks.

  2. Finding the median partition is the problem of discovering a partition as similar as possible to all partitions determined by the attributes of an input database. We introduce a distance measure between partitions based on the notion of generalized conditional entropy. Discovering the median partition is computationally expensive, due to the large number of possible partitions, and to search more efficiently this space, we use a genetic algorithm approach.

  3. Using a sample database, where each object is tagged with its appropriate class, we can estimate the influence of each attribute on the determination of an object's class. We propose a new genetic algorithm for finding the natural clustering of the objects represented by the partition that preserves the same attribute influences as the estimated values. We discuss the type of databases for which our clustering method gives good results and we present a measure to assess this quality.


Created on September 18th, 2002