WHY Inferential Statistics and Hypothesis testing??
In today’s world, with the abundance in
data, every organization wishes to optimize on their products by making it more
customer centric. To do that, they need to develop an understanding of the
population. In order to understand the population, companies need to categorize
customers based on certain parameters like age, education, location, life style,
income and so on. But in most cases, it is extremely difficult to collect data for
the entire population. That is where the concept of samples and sampling comes
into the picture. Sampling is the process of selecting a subset from a
population that would be representative of the population. The idea is to perform
the analysis on a sample and make inferences about the population. Let us look
at a simple example to understand this concept.
Consider a case where we want to find the
average height of men and women in the city of Cincinnati and compare it to a value
that is claimed to be the average height. In order to verify the statement, we would
need to collect the data for the entire population of the city which is next to
impossible. But what we can do is collect data for few localities in the
Cincinnati region and make an estimate for the height of the entire population
in the city. Here, the average height of the population is called the Parameter
of the population and the average height measured using the sample is called
the Statistic of the sample. This sample statistic will be used as an estimate
of the populations average height. The entire process of making an estimate of
a population parameter using a sample statistic is called inferential
statistics.
This was a naïve example. In corporate, there
are several business cases in which hypothesis testing is extensively used. For
example, an airline industry can use it to measure the effectiveness of a
campaign, a pharma company can use it to measure the effectiveness of
introducing a new drug to the market, a manufacturing company can use it for understanding
the quality of a product being produced and so on. We will cover several
business examples in this post in a more detailed.
WHAT is Hypothesis testing??
One of the major techniques used in inferential
statistics is Hypothesis testing. The general notion to any scientific experiment
is to start with a statement which is either to be proved or disproved by performing
experiments. In case of Statistics, this argument or statement that is put forward
is called a Hypothesis. The various tests performed to either prove or disprove
this statement are called hypothesis tests. There
are majorly five types of problems that we would generally come across in statistics.
This post is an overview of these types of hypothesis and the various techniques
used to test the same in the industry.
HOW of Hypothesis testing:
In the following section, the various hypothesis testing techniques along with their business use cases are presented. These are the basic and most widely used techniques in the analytics industry.
·
Test
of means within a group
Consider the case where an outbreak of a disease
was attributed to the level of calcium content in the ice cream produced at a
certain factory. Scientists measured the level of calcium in nine randomly
sampled batches of ice cream. As a matter of fact, the scientists know that
a calcium content of more than 0.3g/sample could be a reason for causing the disease.
So, they want to understand if there is significant evidence that the mean
level of calcium in the ice cream was greater than 0.3 g/sample or not. The general
statistical technique used for testing such scenarios is called a one-sample
t-test. The method would help in comparing the mean value of a certain parameter
to the value that is hypothesized. Inferences will be made based on the results
of the test.
·
Test
of means between two groups
Consider a case where six subjects were
given a drug (treatment group) and an additional six subjects a placebo
(control group) while preforming an experiment to understand if a drugs
effectiveness is statistically significant or not. The reaction time to a
stimulus was measured (in ms). If we want to perform a statistical analysis for
comparing the mean reaction times of the treatment and control groups, the test
we would use is called a two-sample t-test. The results of the test
would help us infer if there was a significant impact because of the drug.
·
Test
of means between two groups with the same entity (paired tests)
Consider a petroleum company that claims the
mileage offered by the company’s premium brand of gas is better than the
regular gas. In order to test the argument, we can perform a study with ten
cars. Each of the ten cars were first filled with regular gas and the mileage
data collected. The mileage was recorded again for the same cars using the premium
gasoline. Here, the statistical test that can be used to determine whether cars
get significantly better mileage with premium gas or not is called a paired-t-test.
We would basically compare the difference in the average mileage obtained by
using the premium gas to the regular gas to Zero. If the difference in means is
significantly more than zero, then we can infer that the premium gas is indeed providing
a better mileage.
·
Test
of population Variance
Consider a company that is manufacturing a
thermostat. The manager of the company claims that the error margin in the
thermostat reading can be maximum of 0.5 degrees. This can be tested by measuring
thermostats readings using ten different thermostats and analysis the spread of
the values read by the instruments. If the spread is within the acceptable limit,
the managers claim is true. This is different from the tests proposed in the
earlier section. Here we want to test the spread of parameter otherwise called
the Variance. The commonly used test of Variance is the Chi-Squared test. By
conducting the test, we can verify of the products meet the specifications claimed.
This type of analysis can be widely used in quality control methods.
·
Test
of means between many groups
We are often interested in determining
whether the means from more than two populations or groups are equal or not. To
test whether the difference in means is statistically significant we can
perform analysis of variance (ANOVA). If the ANOVA F-test shows that there is a
significant difference in means between the groups we may want to perform
multiple comparisons between all pair-wise means to determine how they differ.
This post presented a brief overview of
inferential statistics and Hypothesis testing. There are numerous industrial
use cases for hypothesis testing which are shaping the way business decisions are made. Also, all of these tests can be implemented in tools like
R, SAS, Python, SPSS and so on. We will cover the implementation of hypothesis
testing using a dataset in a separate post in the coming weeks. For now, this posts presented an overview of why, what and how of inferential statistics and
Hypothesis testing. Happy Learning!
Cheers!
Renga
Comments
Post a Comment