The connection between lift and implication occurs when the lift is less than 1: p(A|B)/p(A) < 1. This means that B reduces the probability of A. In logical terms, we would write it as "B implies not A" or "A implies not B" (these two statements are equivalent). In general, you take one of the events involved in the lift calculation, negate it, and make it the consequent of the implication. It doesn't matter which event you negate: the logical meaning is the same (and the lift is symmetric anyway).
If the lift between two events is greater than 1, no logical implication is suggested. The lift of (birth,female) is greater than 1 since the probability of giving birth increases when you know the person is female. But you wouldn't say that being female implies giving birth. The correct implication comes from the small lift of (birth,male) (in fact the lift would be zero).
The second property of lift is that while lift is great at finding strong associations, it doesn't always reflect the practical impact of an association. Suppose two drugs interact in such a way that the probability of an adverse event multiplies by five. If these drugs are rarely taken, this may mean that 5 people had the adverse event as opposed to an expected count of 1 person (nij/eij = 5). Compare that to a pair of drugs which have a minor interaction but are frequently taken: nij = 11000, eij = 10000, nij/eij = 1.1. The second interaction causes 1000 additional adverse events while the first only causes 4 additional events. For this reason, data miners sometimes prefer to use the gain as a measure of the impact of an assocation. Gain is the difference between actual and expected count: gain = nij - eij. A confidence interval on the gain can be obtained from the approximate standard error sqrt(nij), as used in the confidence interval for lift.
In contingency tables, we went through four topics: visualization, sorting, measuring deviations, and abstraction. For numerical batches, visualization is handled by the boxplot. Sorting amounts to sorting the batches by mean or median. Deviations from equality are measured using contrasts, which are differences between means. The standard error of a contrast is straightforward to compute and is given in Moore & McCabe.
Abstraction means abstracting the categorical predictor to have fewer values. For predictive abstraction, we want to merge values that predict the same response, or equivalently we want to merge batches that have the same distribution. Merging is based on a test of the similarity of the two batches. To do this, we need to make assumptions about the distribution. For this course, we will assume the batches are normal.
In introductory statistics, it is usual to assume that the the batches
have the same variance, i.e. our uncertainty in the response is the same
for all factor levels. In this case, comparing two batches
reduces to comparing their means. The canonical statistic for this is the
t-squared statistic whose formula is given below:
Here v is the variance of each batch, assumed to be the same for all batches.
Estimate v using (n1*v1 + n2*v2 + ...)/(n1+n2+...)
where vj is
the sample variance of batch j (I'm ignoring the
minus 1 correction that is sometimes used in the denominator).
Notice that the t-squared statistic for merging batches is the same as the merging cost under Ward's method, because v is constant and can be ignored. This means that t-squared merging is minimizing the same criterion that Ward's is, namely the sum of squares.
However, the purpose of merging is to do data mining, so we typically do
not want to assume that the batches have the same variance. For this we
need a better criterion than the sum of squares. In particular, we need a
merging cost which takes into account the difference in variance as well as
the difference in means. If two batches have the same mean but different
variance, merging them can hide an important distinction.
An objective function which achieves this is the sum of log variances:
The merging cost does pay attention to the difference in means, but even if
the means are equal, the merging cost is not zero.
The merging cost is zero only when both means and both variances are equal.
This objective function, by the way, comes from Bayesian statistics and
is the marginal log-likelihood of the batches under the assumption of
normality. The sum of squares criterion is derived in the same way but
with an equal-variance constraint.
These two merging criteria are nicely illustrated by the time series
discussed on day17. Here it is again:
The mean of this time series doesn't change much, but its variance does.
Hence we would expect an equal-variance assumption to do poorly.
Ward's method gives breaks at 96 and 107, isolating the one period where
the mean seems to change.
The modified merging cost gives breaks at 52 and 199, corresponding to
the places where the variance changes:
Unlike day17, we did not have to convert the data to a contingency table to
get this result, but we did have to assume normality within each segment.
Evaluating associations according to gain is described by Bay and Pazzani, "Discovering and Describing Category Differences: What makes a discovered difference insightful?". They had human experts judge the importance of different rules and found that the experts typically used gain as their criterion.
The Bayesian likelihood for normality is discussed in "Inferring a Gaussian distribution".