Day 2 - Summarizing a batch
View the class slides
[PDF]
Code for the spray example
Synopsis
When a data set is large, containing a huge number of factors
influencing a response, we end up with a lot of batches of numbers.
It is important that we simplify these batches as much as possible
without hiding important details. The levels of simplification we
can apply are:
- Strip chart (just sorting and spacing)
- Histogram (distribution)
- Boxplot (center,spread,skew)
- (center,spread) representation
- center only
There are various obstacles to reaching a high level of simplification,
which we would like to eliminate.
If the batches have different spread, we are prevented from using center only.
If the batches are skewed, we are prevented from going beyond a boxplot.
If there are outside points, we are prevented from including them in
our summary.
Transformation can often eliminate these problems.
If the distribution has multiple peaks, then transformation doesn't
work and we are prevented from going beyond a histogram. We will
discuss how to handle this case later.
You may worry that transformation hides the absolute scale of differences.
After you've found an interesting `nugget', you can transform back to
report results. But transformation helps you find the nugget in the first
place.
The lecture also described some basic principles of plotting, and we
will add to these as the course goes on.
Optional reading
Measures of center and spread, boxplots
- Introduction to the Practice of Statistics, Chapter 1, by Moore and McCabe
Transformations for symmetry
-
EDA notes by Michael Friendly
It includes a method for automatically finding the right transformation.
Tom Minka
Last modified: Mon Aug 26 15:14:35 EDT 2002