One-Variable Data Analysis
To describe a dataset in terms of its shape, center, and spread.
terms to know: shape/ gaps/ clusters of datapoints/ outliers
- symmetric(has symmytry around some axis)
- mound-shaped (bell-shaped)
- skewed (data are skewed to the left if the tail is to the left)
(skewed to the left)
- bimodal双峰 (has more than one location with many scores)
- uniform (frequencies of the various values are more or less constant)
There are four types of graph to help us understand the shape of distribution.
a very simple type of graph that involves plotting the data values with dots.
- Stemplot (Stem and Leaf Plot)
They are typically used when there is a medium amount of quantitative variables to analyze; Stemplots of more than 50 observations are unusual. The name “Stem plot” comes because there is one “stem” with the largest place-value digits to the left (shaded in gray in the image below) and one “leaf” to the right.
- A bar graph is used to illustrate qualitative data, and a histogram is used to illustrate quantitative data. The horizontal axis of a histogram contains numercial values, and the vertical axis contains the frequencies or relative frequencies,of the values.
The boundaries for the class interval of this graph are 12.5,17.5, 22.5, 27.5, 32.5, 37.5...
Measures of Center
There are two primary measures of center: the mean and the median. There is a third measure, the mode.
The Mean od the set is defined as the sum of the x's divided by n. Symbolically,
The median of an or ordered dataset is the "middle" value in the set.
If the distribution is symmetric and mound shaped, the mean and the median will be close. If the distribution has outliers or strong skewed, the median is probably the better choice to describe the center. This is because it is a resistant statistic, one whose numerical value is not deamatically affected by extreme values, while the mean is not resistant.
Measures of Spread
Variance and Standar Deviation
One measure of spread based on the mean is the variance. Variance is the average squared deviarion from the mean. That is, it is a measure of spread because the more distant a value is from the mean, the larger will be the square of the difference between it and the mean.
(这里在求population的variance，所以分母为n，population size。所以分子部分Σ里减去的是μ，expected value。下面的公式求的是s，sample的standard deviation，所以分母为sample size-1，Σ内为xi-x bar。)
The square root of the varianc is known as the standard deviation.
##Notice that the μ is the population mean while x bar is the sample mean. So what the first Variance is the population variance while the latter standard deviation formula is for standard deviation of the sample because it uses x bar instead of μ.
σ is the standard deviation of population. However, in statistics, most of time you are dealing with sample data and not a distribution. Remember not to garble these two.
Although the standar deviation wors well in situations where the mean works well (symmetric distribution), we need a mearsure where mean-based measure is not appropriate, that is the interquatile range.
quartiles： The medians of the upper and lower halves of the distribution not including the median itself in either half are called quartiles.
Lower quartiles, or first quartiles: the 25th percentile (Q1 on calculator)
Upper quatiles: the median of the upper half /third quartiles, the 75th percentile (Q3 on calculator)
The median itself can be thought as the second quartiles.
Outlier is a value far removed from the others. Some texts define an outlier as a datapoint that is more than two or three standard deviations from the mean but there is no rigourous mathemetical formula for determing whether or not someyhing is an outlier.
- Find the IQR (IQR=Q3-Q1)
- Multiply the IQR by 1.5
- Find Q1-1.5(IQR) and Q3+1.5IQR
- Any value below or above these two is an outlier.
online examples: khan academy-find the outliers.
Position of a Term in a Distribution
The five-number summary of a dataset is composed of minimum value, the lower quartile, the median, the upper quartile, and the maximum value.
On Ti84: minX, Q1, Med, Q3, and maxX
A boxplots is simply a graphical version of the five-number summary.
at this graph, two dots represents the upper quartile and the lower quartile, which is 10 and 2. while median is the vertical line in the box, which is 6. two sides of the box represents the Q2 and Q3.
examples: khan academy-reading box plots
Percentile Rank of a Term
The percentile rank od a term in a distribution equals the proportion of terms in the distribution less than the term.
For example, a term that is the 75th percentile is larger than 75% of the terms in distribution.
z-score represents the number of standard deviation the term is above or below the mean.
（s is the standard deviation of sample.）
example: khan academy Z score
Normal distribution is symmetric and looks like a bell curve.
68% of the terms in a normal distribution are within one standard deviation of the mean/
95%....two standard deviation
99.7...three standard deviation
Standard Normal Distribution
If x is a variable that has a normal distribution with mean μ and standard deviation σ, there is a related distribution we obtain by standardizing the data in the distribution to produce the standard normal distribution.
Now the z score is
remember to use calculator to count the area.
订阅 点记 Little Note