/ 没事折腾

AP统计学学习笔记一:单变量数据分析——膝盖Robert

这篇文章的阅读量为

ap统考将至

所以复习一遍,顺便总结下知识点。知识点摘自五分制胜AP统计学

One-Variable Data Analysis

To describe a dataset in terms of its shape, center, and spread.

graphical analysis

terms to know: shape/ gaps/ clusters of datapoints/ outliers

Shape:

  • symmetric(has symmytry around some axis)
  • mound-shaped (bell-shaped)
  • skewed (data are skewed to the left if the tail is to the left)

(skewed to the left)

  • bimodal双峰 (has more than one location with many scores)
  • uniform (frequencies of the various values are more or less constant)

There are four types of graph to help us understand the shape of distribution.

  • Dotplot

a very simple type of graph that involves plotting the data values with dots.  

  • Stemplot (Stem and Leaf Plot)  

They are typically used when there is a medium amount of quantitative variables to analyze; Stemplots of more than 50 observations are unusual. The name “Stem plot” comes because there is one “stem” with the largest place-value digits to the left (shaded in gray in the image below) and one “leaf” to the right.  

  • Histogram  
  • A bar graph is used to illustrate qualitative data, and a histogram is used to illustrate quantitative data.   The horizontal axis of a histogram contains numercial values, and the vertical axis contains the frequencies or relative frequencies,of the values.  

The boundaries for the class interval of this graph are 12.5,17.5, 22.5, 27.5, 32.5, 37.5...

Measures of Center

There are two primary measures of center: the mean and the median. There is a third measure, the mode.

Mean

The Mean od the set is defined as the sum of the x's divided by n. Symbolically,

Median

The median of an or ordered dataset is the "middle" value in the set.

Resistant

If the distribution is symmetric and mound shaped, the mean and the median will be close. If the distribution has outliers or strong skewed, the median is probably the better choice to describe the center. This is because it is a resistant statistic, one whose numerical value is not deamatically affected by extreme values, while the mean is not resistant.

(五分制胜pg67)

Measures of Spread

Variance and Standar Deviation

One measure of spread based on the mean is the variance. Variance is the average squared deviarion from the mean. That is, it is a measure of spread because the more distant a value is from the mean, the larger will be the square of the difference between it and the mean.

(这里在求population的variance,所以分母为n,population size。所以分子部分Σ里减去的是μ,expected value。下面的公式求的是s,sample的standard deviation,所以分母为sample size-1,Σ内为xi-x bar。)

The square root of the varianc is known as the standard deviation.

##Notice that the μ is the population mean while x bar is the sample mean. So what the first Variance is the population variance while the latter standard deviation formula is for standard deviation of the sample because it uses x bar instead of μ.

σ is the standard deviation of population. However, in statistics, most of time you are dealing with sample data and not a distribution. Remember not to garble these two.

Interquartile Range

Although the standar deviation wors well in situations where the mean works well (symmetric distribution), we need a mearsure where mean-based measure is not appropriate, that is the interquatile range.

quartiles: The medians of the upper and lower halves of the distribution not including the median itself in either half are called quartiles.

Lower quartiles, or first quartiles: the 25th percentile (Q1 on calculator)

Upper quatiles: the median of the upper half /third quartiles, the 75th percentile (Q3 on calculator)

The median itself can be thought as the second quartiles.

Outliers

Outlier is a value far removed from the others. Some texts define an outlier as a datapoint that is more than two or three standard deviations from the mean but there is no rigourous mathemetical formula for determing whether or not someyhing is an outlier.

(五分制胜pg71 example)

  1. Find the IQR (IQR=Q3-Q1)
  2. Multiply the IQR by 1.5
  3. Find Q1-1.5(IQR) and Q3+1.5IQR
  4. Any value below or above these two is an outlier.

online examples: khan academy-find the outliers.

Position of a Term in a Distribution

Five-Number Summary

The five-number summary of a dataset is composed of minimum value, the lower quartile, the median, the upper quartile, and the maximum value.

On Ti84: minX, Q1, Med, Q3, and maxX

Boxplots

(这里的顺序遵从五分制胜中的顺序,内容属于graphical analysis)

A boxplots is simply a graphical version of the five-number summary.

at this graph, two dots represents the upper quartile and the lower quartile, which is 10 and 2. while median is the vertical line in the box, which is 6. two sides of the box represents the Q2 and Q3.

examples: khan academy-reading box plots

(五分制胜pg.72 example)

Percentile Rank of a Term

The percentile rank od a term in a distribution equals the proportion of terms in the distribution less than the term.

For example, a term that is the 75th percentile is larger than 75% of the terms in distribution.

Z-scores

z-score represents the number of standard deviation the term is above or below the mean.

(s is the standard deviation of sample.)

example: khan academy Z score

Normal Distribution

Normal distribution is symmetric and looks like a bell curve.

68-95-99.7 rule

68% of the terms in a normal distribution are within one standard deviation of the mean/

95%....two standard deviation

99.7...three standard deviation

Standard Normal Distribution

If x is a variable that has a normal distribution with mean μ and standard deviation σ, there is a related distribution we obtain by standardizing the data in the distribution to produce the standard normal distribution.

Now the z score is

z=(x-μ)/σ

remember to use calculator to count the area.

大家可以去khan academy去找题做:)