1.Populations, Samples, and Processes
An investigation will typically focus on a well-defined collection of objects constituting a population of interest.
When desired information is available for all objects in the population, we have what is called a census.
A subset of the population -- a sample -- is selected in some prescribed manner.
A variable is any characteristic whose value may change from one object to another in the population.
- Univariate data set consists of observations on a single variable.
- Bivariate data is when observations are made on each of two variables.
- Multivariate data arises when observations are made on more than two variables.
Branches of Statistics
- An investigator who has collected data may wish simply to summarize and describe important features of the data. This entails using methods from descriptive statisitcs.
- Techniques for generlizing from a sample to a population are gathered within the branch of our discipline called inferential statisitcs.
Enumerative Versus Analytic Studies
- Enumerative studies, interest is focused on a finite, indetifiable, unchanging collection of individuals or objects that make up a population.
- Analytic studies are often carried out with the objective of improving a future product by taking action on a process of some sort.
Collecting Data
2. Pictorial and Tabular Methods in Descriptive Statistics
Notation
The number of observations in a single sample will often be denoted by n.
Given a data set consisting of n obversations on some variable x, the individual observations will be denoted by x1,x2,x3,...,xn.
Stem-and-Leaf Displays
Steps for Constructing a Stem-and-Leaf Display
- Select one or more leading digits for the stem values. The trailing digits become the leaves.
- List possible stem values in a vertival column.
- Record the leaf for every observation beside the corresponding stem value.
- Indicate the units for stems and leaves somplace in the display.
Dotplots
A dotplots is an attractive summary of numerical data when the data set is reasonably samll or there ar relatively few distinct data values. Each observation is represented by a dot above the corresponding location on a horizontal measurement scale.
Histograms
- A variable is discrete if its set of possible values either is finite or else can be list in an infinite sequence.
- A variable is continuous if its possible values consist of an entire interval on the number line.
Consider data consisting of observations on a discrete variable x.
- The frequency of any particular x value is the number of times that value occurs in the data set.
- The relative frequency of a value is the fraction or proportion of time the value occurs.
- A frequency distribution is a tabulation of the frequencies and/or relative frequency.
Histogram Shapes
Histograms come in a variety of shapes.
- Unimodal histogram is one that rises to a single peak and then delines.
- Bimodal histogram has two differernt peaks.
- Multimodal histogram has more than two peaks.
- A histogram is symmetric if the left half is the mirror image of the right half.
- A unimodal is positively skewed if the right or upper tail is stretched our compared with the left or lower tail and negatively skewed if the stretching is to the left.
Qualitative Data
Multivariate Data
3. Measures of Location
The Mean
For a given set of number x1,x2,x3,...,xn, the most familiar and useful measure of the center is the mean, or arithmetic average of the set.
The Median
The word median is synonymous with "middle", and the sample median is indeed the middle value when the observations are ordered from smallest to largest.
Other Measures of Location: Quartiles, Percentiles, and Trimmed Means
A trimmed mean is a conpromise between mean and median. A 10% trimmed mean, for example, would be computed by eliminating the smallest 10% and the largest 10% of sample and then averaging what is left over.
Categorical Data and Sample Proportions
4. Measures of Variability
Measures of Variability for Sample Data
The simplest measure of variability in a sample is the range, which is the difference between the largest and smallest sample values.
The sample variance, denoted by s2;
The sample standard deviation, denoted by s.
Motivation for s2
We will use σ2 to denote the population variance and σ to denote the population standard deviation.
It is customary to refer to s2 as being based on n-1 degrees of freedom(df).
This terminology results from the fact that although s2 is based on the n quantities, these sum to 0, so specifying the values of any n-1 of the quantities determines the remaining value. For example, if n=4 and x1-x=8,x2-x=-6,x4-x=-4, then automatically we have x3-x=2, so only 3 of the 4 values of xi-x are freely determined(3df).
A Computing Formula for s2
Boxplots
After the n observations in a data set are ordered from smallest to largest, the lower fourth and upper fourth are given by:
lower fourth:
- median of the smallest n/2 observations, n even
- median of the smallest (n+1)/2 observations, n odd
upper fourth:
- median of the largest n/2 observations, n even
- median of the largest (n+1)/2 observations, n odd
That is, the lower(upper) fourth is hte median of the smallest(largest) half of the data, where the median is included in both halves if n is odd. A measure of spread that is resistant to ourliersis th fourth spread ƒs, given by:
ƒs = upper fourth - lower fourth
Boxplots that Show Outliers
Any observation father than 1.5ƒs from the closest fourth is an outlier. An outlier is extreme if it is more than 3ƒs from the nearest fourth, and it is mild otherwise.
Comparative Boxplots
A comparative or side-by-side boxplot is a very effective way of revealing similarities and differences between two or more data sets consisting of observations on the same variable.