| Previous | Table of Contents | Next |
There are many possible approaches for implementing high level search in a data mining environment. The following are some definitions used in the discussion of high level search strategies. Following the definitions are several high level search strategies that have been cited as popular alternatives for implementation in present day data mining environments [5]:
Definitions:
High Level Search Strategies:
PROBLEM STATEMENT
The problem addressed in this chapter consists of using a genetic algorithm to search a synthetically generated database in order to discover a relationship between data items. The synthetic database contains N transactions, where N varies based on the test being run. Each transaction, similar to transactions found in a retailers database, consists of a list of items purchased, where each item is one of 100 different items. Given a database with the characteristics just described, the goal is then to determine the four items that are most often purchased together.
The remainder of this section discusses an overview of the problem implementation, the format of a synthetic data set, and a supporting program, the synthetic data generator.
Implementation Overview
If we are to find four related items out of a possible 100, this implies that there are 100 choose 4, or 3,921,225, combinations of items that we can potentially investigate. As previously mentioned, it is unreasonable to investigate all 3,921,225 combinations of potential relationships since, for each combination, the entire database (or a sufficiently large subset) has to be scanned to determine how often the four items appear together. To address this problem, a genetic algorithm is used to investigate a subset of the possible combinations. The intent was that by using a genetic-based search, the combination of four items that are most often purchased together could be determined much more efficiently than through an enumerative investigation of every possible combination. Genetic algorithm specifics on coding, fitness, and operators is given in the following section, Genetic Algorithm Specifics. Thus, an overview of the genetic algorithm in the data mining problem domain is presented next.
Before performing any type of genetic search, a coding scheme was determined to represent the four potentially related items. To prepare for the search, X sets (the population size) of four items are picked at random and then coded into strings based on the coding scheme. These X strings represent generation 0 of the search. Then, for each of the X sets of four coded items, the entire database (or a sufficiently large subset) is scanned to generate a statistic (the fitness) as to how often the four items are purchased together. This fitness is based on the frequency with which the set of four items appear within the database under investigation. In addition, if a fractional number of the items in the item list appear within a transaction, the fitness reflects that fraction.
Once a fitness value is determined for each set of items in the population, the three genetic operators (reproduction, crossover, and mutation) are performed to generate a new population of strings. After performing reproduction, crossover, and mutation, if all goes as planned, each successive population should contain more highly fit strings, ultimately leading to a set of the four most frequently occurring items within the database.
In order to fully verify and analyze this genetic algorithm approach, simulations were run using several different fitness functions, crossover operators, mutation operators and genetic algorithm parameters. In addition, simulations were run using several different size data sets with varying characteristics (see the database configuration matrix in the section, Results for details). The synthetically generated data, as described in the next section, provided the advantage of being able to represent specific relationships within a data set and facilitated verification of the genetic algorithm-based search approach.
| Previous | Table of Contents | Next |