Computational Intelligence - November 2012 - 39

exploration and validation. While we
utilized UCS in this study and focused on a
very specific data mining problem, this
pipeline could be expanded to knowledge
discovery in any M-LCS applied to a single
step data mining problem.

Our proposed analysis pipeline includes the following
steps: (1) run the M-LCS algorithm with ten-fold cross
validation (CV) on the dataset, (2) run a permutation
test with 1000 permutations, (3) confirm significance
of testing accuracy, (4) identify significant attributes
and significantly co-occurring pairs of attributes,
(5) train the M-LCS algorithm on the entire dataset,
(6) generate a clustered heat-map of the
rule-population, (7) generate a network depicting
attribute co-occurrence, and (8) combine statistical
results with visualizations to interpret and generate
hypothesis for further exploration and validation.
rule population evolved by UCS on the noisy target dataset
considered in this study. Table 1 displays the top 10 rules
identified by UCS after 200,000 learning iterations. This
rule population is the same one applied in section 3.0.3 for
visualization. Half the samples in the dataset were generated
with a predictive epistatic interaction between attributes
'X0' and 'X1,' while the other half were generated with a
different epistatic interaction between attributes 'X2' and
'X3.' All other attributes were randomly simulated as nonpredictive. Optimally generalized rules would strictly specify
one of these two pairs of attributes (i.e., X0, X1 or X2, X3),
but no others. Examining the top 10 rules in Table 1, we can
see that this is never the case. We do see that the correct
attribute pairs often occur together in these top rules, however one or two other non-predictive attributes tend to be
specified as well. Specification of these non-predictive attributes affords the rule higher accuracy in the training data
(all top rules in Table 1 have 100% accuracy). Scanning
down the complete ordered list of rules, we finally observed
an optimally generalized rule 43rd down on the list, however without already knowing the true association pattern we
would have no way of identifying that rule as optimal.
So how do we separate the attributes that are reliable, and
those which are the product of over-fitting in a noisy
environment?

3.0.1 Run the M-LCS

This section details steps 1 and 2. First we
employed a 10-fold CV strategy in order to
determine average testing accuracy and
account for over-fitting. The dataset is randomly partitioned into 10 equal parts and
UCS is run 10 separate times during which
9/10 of the data is used to train the algorithm, and a different 1/10 is set aside for
testing. We averaged training and testing
accuracies over these 10 runs. Next we set
up our per mutation test. A per mutation test involves
repeating the analysis on variations of the dataset (with class
status shuffled) in order to determine the likelihood that
the observed result could have occurred by chance. We
chose to use the permutation test since we do not know
the chance distribution of our statistics ahead of time. We
generated 1000 permuted versions of the original dataset
by randomly permuting the affection status (class) of all
samples, while preserving the number of cases and controls.
For each permuted dataset we ran UCS using 10-fold CV.
In total, permutation testing requires 10,000 runs of UCS.
We performed this analysis using "Discovery," a 1372 processor Linux cluster.
3.0.2 Significance Testing
of M-LCS Statistics

This section details steps 3 and 4. First, and foremost, we confirmed that our average testing accuracy from Step 1 is significantly higher than those obtained by random chance. We
utilized a typical one-tailed permutation test with a significance
threshold of p < 0.05. To determine p for a test statistic (in this
case average testing accuracy) calculate the test statistic for each
of the 1000 permuted, CV analyses. If the true test statistic from
Step 1 is greater than 95% of the 1000 permuted runs, you can
reject the null hypothesis at p < 0.05. Here, the null hypothesis
is: the observed value of the statistic could have likely occurred

3. Analysis Pipeline

Our proposed analysis pipeline includes the following steps:
(1) run the M-LCS algorithm with 10-fold cross validation
(CV) on the dataset, (2) run a permutation test with 1000
permutations, (3) confirm significance of testing accuracy,
(4) identify significant attributes and significantly co-occurring pairs of attributes, (5) train the M-LCS algorithm on
the entire dataset, (6) generate a clustered heat-map of the
rule-population, (7) generate a network depicting attribute
co-occurrence, and (8) combine statistical results with visualizations to interpret and generate hypothesis for further

Table 2 Example calculation of SpS and AWSpS within
unordered hypothetical rules.
x1

ClaSS NumeroSiTy aCCuraCy

0.73

0.51

0.88

0.62

SpS

AWSpS

5.41 2.89 0.62 5.92

NOVEMBER 2012 | IEEE COMputatIONal INtEllIgENCE MagazINE

Table of Contents for the Digital Edition of Computational Intelligence - November 2012