Tags:
A link to notes (PDF) I had when I took a statistics unit a while ago. Not guaranteed to be correct.
Statistics the science of generalizing knowledge from data. Population complete set of elements being studied.
Variations in sampling
z critical confidence interval critical value
Why normal distribution is important:
A point estimator of the population mean:
Most repeated
Value in the middle
Spread is variance, how spread out is the graph?
Applications in:
In finance risk is often another term for a stock's variance. Some stocks are steady (low risk) but offer lower potential returns. Others are swing wildly (high risk) but offer more potential upside.
Measures the average distance your data values are from the mean. It's also \sqrt{variance}.
Closely grouped data has a large standard deviation and the opposite for spread out data.
It is:
Lower case /s/ means sample standard deviation
In sample standard deviation we do /n-1/ to overestimate the variation because /-1/ decreases the denominator making the result, s, bigger.
Simpler formula:
or
Symbol for population standard deviation is \sigma. \ \mu is population mean.
We don't divide by n-1 but N because we don't want to overestimate our population.
How much proportion or percentage of a dataset will fall within certain std devs from the mean. Applies only to a normally distributed dataset. Also called 68%, 95%, 99.7% rule. \ If data is normally distributed then:
Data values /within/ 2 standard deviations are usual. Data values /outside/ 2 standard deviations are unusual. A data value outside of 3 standard deviations from the mean is extremely rare.
Given different standard deviations (with different units, values, samples etc), we have to find a way to represent what has more spread. To do this we use:
When we take many samples of the same size from a population and find the sample means \bar x. The means of those samples follow a normal curve when placed in their own distribution.
When we take many samples of the same size from a normal population and then fine those sample variances s^2, those sample variances don't follow a normal curve when placed in their own distribution.
They follow the chi-square \chi^2 distribution with n-1 degrees of freedom.
Compares sample variance to pop variance. We try to estimate population variance
A \chi^2 distribution has a tail to the right.
Given n = 12 and confidence level = 95%. Find the critical value which make the distribution.
Solution
Because we try to estimate pop variance then
We have two values A and .
From above it's .
The is larger therefore when is divided by it, we get a smaller value. Therefore:
Variance:
Standard deviation:
The and confidence level are complimentary
This says our pop variance lies within this range with 95% certainty
We sample 10 phone chargers and we have a std dev of 0.15 volts. Construct a 95% CI for and .
Solution
For voltage specifically use the sqaure root to get:
Whether 2 sample variances are equal given the limits of random sampling We want to know whether a difference is statistically significant or caused by a sampling error.
The distribution of F ratios sample df = n - 1
Are the variances equal or not?
Comparing measures between or within datasets. This lets you compare the variation of two samples or populations.
The ratio of standard deviation to the mean as a percentage
The number of standard deviations that data value is away from the mean. Same for sample as well as population. Z scores can be negative or positive. A z score at the mean is 0 Z scores can also be usual >= -2 && <= 2 or unusual < -2 && > 2 The larger the z score in terms of absolute value the more rare the data.
Sample
Population
Data has to be sorted, has to be values. Go from left to right:
|---|---|---|----|----|----|----|----| | 1 | 3 | 6 | 10 | 15 | 21 | 28 | 36 | |---|---|---|----|----|----|----|----|
In the sample above:\ Q1 = 4.5\ Q2 = 12.5\ Q3 = 24.5
|---|---|---|----|----|----|----|----|----| | 1 | 3 | 6 | 10 | 15 | 21 | 28 | 36 | 39 | |---|---|---|----|----|----|----|----|----|
We pretend in this case that the 15 doesn't exist.\ Q1 = 4.5\ Q2 = 15\ Q3 = 32
Separates data into 100 parts. We have 99 parts
75th percentile - 25th percentile
The population is defined by the researcher e.g all women, all bulbs produced by a certain company etc. Populations can be large, it's hard to collect data on each member of a population.
Collecting data on all members of a population is called a [[https://en.wikipedia.org/wiki/Census][census]].
When we need to make a conclusion about our population we use a sample. A small but well chosen sample can accurately represent the population.
Sample guidelines:
Kinds of samples:
A sample is always an approximation of the population. Therefore:
Using a single value from a sample to approximate an entire population parameter.
You have no idea how accurate the point estimate is.
Range of numbers used to estimate a population parameter. Estimates a population proportion from a sample proportion.
They have:
Requirements:
Example: The 95% CI for p is 0.38 < p < 0.497 I don't know what p is but I'm 95% sure that it falls between 0.38 and 0.497 of the population
A z score that separates the likely region from the unlikely region.
Max difference between ^{p} and p.
We are given that n = 670, ^{p} = 0.85, we also jusqt learned that the standard eror of the sample proportion is SE = p(1-p). Which of the below is the correct calculation of the 95% confidence interval?
Measure of distance from the mean. \ How far from the mean is a is a given data point. How many standard deviations away (above or below) from the mean is a data point. Standard deviation here is a unit of measurement like a kg, meter etc.
z-scores are standardized measures where the unit is a standard deviations.
z score of the mean is 0 because it's zero distance from itself.
Like mass of person x is 5 kgs; we can say z-score of x is 1 standard deviations.
Independent variable should be on the x axis while the dependent variable should be on the y axis. Correlation seeks a statistical relationship between two variables or bivariate data.
A regression model is unique to the data it represents. Adding data will change the regression model. It's not proper to extrapolate above or below data being evaluated. How much better is our line of fit compared to only using the mean of the dependent variable.
It is tempting to assume that one variable causes another however, correlation doesn't imply causation.
Correlation coefficient is a popular way of summarizing a scatter plot into one value between 1 and -1.
/A weak correlation is closer to 0; whereas a strong correlation is near 1 or -1/
Helps fit a straight line through the data Minimum square distances between fitted line and individual points
Remembers if slope is pointing upwards or downwards
Shows how well the slope fits the data based on whether the correlation is weak or strong
Trying to see whether more fertilizer leads to higher yields of beans
|------------------|---|---|---|---|---|---|---| | Fertilizer (lbs) | 2 | 1 | 3 | 2 | 4 | 5 | 3 | |------------------|---|---|---|---|---|---|---| | Bushels of beans | 4 | 3 | 4 | 3 | 6 | 5 | 5 | |------------------|---|---|---|---|---|---|---|
x | y | x \cdot y | x^2 | y^2 |
---|---|---|---|---|
2 | 4 | 8 | 4 | 16 |
1 | 3 | 3 | 1 | 9 |
3 | 4 | 12 | 9 | 16 |
2 | 3 | 6 | 4 | 9 |
4 | 6 | 24 | 16 | 36 |
5 | 5 | 25 | 25 | 25 |
3 | 5 | 15 | 9 | 25 |
These are values of how far our values are from the line of best fit
Calculated by: r^2 = SSR/SST
When r^2*100 we get the percentage of results due to SSE
We compare a model of the dependent variable on it's own against a model of the dependent variable against the independent variable.
Assume:
Get the centroid point made by ( (\bar x, \bar y) ). Your line of best fit must pass through the centroid.
TODO
$$\Sigma (x_i - \bar x)^2$$
SSR = Sum of squared errors of \bar y alone - Sum of squared errors of best line of fit
Dependent variable is binary We want to link our probabilities back to 0 & 1
Logistic regression seeks to:
Odds are probability of something occurring / probability of something not occurring
$$odds = \dfrac{P(occurring)}{P(not\ occurring)}$$
Probability of it not occurring is: 1 - probability of it occurring
$$odds = \dfrac{p}{q} = \dfrac{p}{1-p}$$
/What about events that have a probability of 1 occurring? We get odds of infinity/
Odds of getting heads:
odds(heads) = \dfrac{0.5}{0.5} = 1 or 1:1
Odds of getting 1 or 2:
odds(1 or 2) = \dfrac{0.333}{0.666} = \dfrac{1}{2} = 0.5 or 1:2
Odds of pulling out a diamond card:
odds(diamonds) = \dfrac{0.25}{0.75} = \dfrac{1}{3} = 0.333 or 1:3
/There are 52 cards in a deck and 4 types of cards (diamond, spade, flowers & hearts) and in equal numbers/
A ratio of two odds We are comparing the likelihood of getting an outcome in two separate "systems"
If we want to know how much we increase the odds of getting an outcome by changing one variable and holding all others constant. /odds ratio for a variable show how the odds change with 1 unit increase in that variable holding all other variables constant./ e.g
Say we want to start a casino and want to make some loaded coins to make sure the house wins. We may want to know how to load our coin so that the house wins but the players also win a few times to keep them coming. We want to know how many more times our loaded coin will get a certain outcome compares to a fair coin.
P(heads) = \dfrac{7}{10} = 0.7 \ odds(heads) = \dfrac{0.7}{0.3} = 2.333
P(heads) = \dfrac{1}{2} = 0.5 \ odds(heads) = \dfrac{0.5}{0.5} = 1.0
Odds ratio would be: \dfrac{2.333}{1.0} = 2.333
This means that in the loaded coin we are 2.333 more times likely to get heads than on the fair coin. Loading the coin by 2 increases the odds of getting a heads by 2.333
The odds ratio for a variable in logistic regression represents how the odds change with 1 unit increase in that variable holding all other variables constant.
By increasing our credit score by one how do we affect the probability of getting a loan approved?
Body weight and sleep apnea. Categories:
Weight variable has an odds ratio of 1.07
This means a 1 pound increase in body weight increase the odds of having sleep apnea by 1.07.\ A 10 lbs increase in weight increase the odds to 1.98.\ A 20 lbs increase raised odds to 3.87.
One could have high odds but still low probability for something. You may increase your odds of something but the probability of getting that outcome was still low to begin with. Another may have lower odds but high probability of getting an outcome.
Take the case of people in different ages on different diets and on different drugs and their chances of them getting sick because of it. Younger people have a low probability of getting sick whether or not they do things that increase their odds of getting sick.
Odds can have a large magnitude change even if the underlying probabilities are low.
We don't know p and we wish to estimate it. The estimate of p is written src_LANG[headers]{\hat p} (p hat). We need a function that links the independent variable x axis with probabilities on the y axis.
We are estimating an unknown p for any given linear combo of independent variables. In the logit function we have 0 to 1 running along our x axis but we want to have them on our y axis. We can achieve that by taking the inverse of the logit function.
Home owners loans n = 1000 1 approved 0 not aprroved
order matters
The number of different ways that r objects can be selected from n objects. If there are n objects, how many different ways can we select groups of size r?
Often said as n choose r, denoted as C(n,r)
Order doesn't matter Think of sets.
C(n,r) = \dfrac{n!}{r! (n-r)!}
The outcomes are finite and must be integers
A type of discrete distribution.
The probability of any given outcome is a combination of both the number of trials and the success rate.
Binomial Bi two and nomial is a name in our case an outcome, 2 outcomes. We categorize our outcomes as either a success or a failure.
Where (look under [[Finite math]]):
In a die, what is the probability of rolling a 4 is 30%. The die is rolled 10 times. Find the probability of rolling eight 4s.
Solution
This is the estimated population standard deviation from the sample standard deviation. Sample mean is unlikely to be equal to population mean. Standard deviation of the means of many samples from the population mean.
This is the variability among/between sample means vs variability within each sample
\ Therefore, the samples are likely to come from the same population. \ Why not multiple t-tests? The error compounds in each t-test. \
ANOVA is really a variability ratio:
If variance between the means is relatively large than within the means ratio will be much larger than 1 and the samples likely don't come from a common population.
Overview
At least one mean is an outlier and each distribution is narrow; distinct from each other
Means are fairly close to overall mean and/or distributions overlap a bit, hard to distinguish
Means are very close to overall mean and/or distributions melt together
Also called single factor ANOVA (ANalysis Of VAriance).
Without getting the avg of the sum of squared deviations
SST (Sum of Squares Total)
SSC (Sum of Squares of the Columns)
SSE (Sum of Squares Error)
SST = SSE + SSC
Sum of squares: SS = \Sigma (x-\mu)^2
Sample variance:
H_0: \mu_1 = \mu_2 = \mu_3 \ H_\alpha: There is at least one difference among the means \alpha = 0.05
|---|---|---| | 1 | 2 | 3 | |---|---|---| | 1 | 2 | 2 | | 2 | 4 | 3 | | 5 | 2 | 4 |
Means within:
Means between:
Degrees of freedom:\
From the above we get the F_{critical} from our table.
For the above in our table we get F_{critical} of 5.14
$$ANOVA = \frac{SS_{total} - SS_{within}}{SS_{within}} = \frac{SS_{between}}{SS_{within}}$$
Therefore, we fail to reject our H_0 \ /Mean squared between/ is also /variance between/
Out of scope
How to conduct hypothesis tests on 2 population means.
Shows the difference within groups and compares it to difference between the same groups. In the case of the paired t-test we get the t value for paired data.
The paired t-test - also called two sample, within subjects, repeated measures and dependent samples t-test - is a statistical method used to measure the change within the same sample after an event occurs. It uses paired or dependent data (where the data in one sample affects the data in the other sample e.g before and after a process such as taking a drug).
Paired t-test on the effectiveness of a weight loss drug.
#+CAPTION: Positive values indicate weight loss and negative values indicate weight gain | subject | on drug | on placebo | d_i | |---------|---------|------------|------| | 1 | 1.1 | 0 | 1.1 | | 2 | 1.3 | -0.3 | 1.6 | | 3 | 1.0 | 0.6 | 0.4 | | 4 | 1.7 | 0.3 | 1.4 | | 5 | 1.4 | 0.7 | 0.7 | | 6 | 0.1 | -0.2 | 0.3 | | 7 | 0.5 | 0.6 | -0.1 | | 8 | 1.6 | 0.9 | 0.7 | | 9 | -0.5 | -2.0 | 1.5 | |---------|---------|------------|------| | | | | 7.6 | #+TBLFM: $4=vsum(@2$4..@10$4)
The result is significant at p < 0.05 Since p-value (0.005) > than \alpha (0.05), we reject H_0. Therefore, we accept H_1 that our drug is effective at weight loss because there's only 0.005 chance that the weight loss was not because of the drug.
Get them from the p tables given the z score
Types of variables:
Exposure and outcome variables
predictor ::
response ::
Cumulative frequency :: summation of frequency
Testing whether a claim is valid.