Inferential Statistics & Hypothesis testing

WHY Inferential Statistics and Hypothesis testing??

In today’s world, with the abundance in data, every organization wishes to optimize on their products by making it more customer centric. To do that, they need to develop an understanding of the population. In order to understand the population, companies need to categorize customers based on certain parameters like age, education, location, life style, income and so on. But in most cases, it is extremely difficult to collect data for the entire population. That is where the concept of samples and sampling comes into the picture. Sampling is the process of selecting a subset from a population that would be representative of the population. The idea is to perform the analysis on a sample and make inferences about the population. Let us look at a simple example to understand this concept.

Consider a case where we want to find the average height of men and women in the city of Cincinnati and compare it to a value that is claimed to be the average height. In order to verify the statement, we would need to collect the data for the entire population of the city which is next to impossible. But what we can do is collect data for few localities in the Cincinnati region and make an estimate for the height of the entire population in the city. Here, the average height of the population is called the Parameter of the population and the average height measured using the sample is called the Statistic of the sample. This sample statistic will be used as an estimate of the populations average height. The entire process of making an estimate of a population parameter using a sample statistic is called inferential statistics.

This was a naïve example. In corporate, there are several business cases in which hypothesis testing is extensively used. For example, an airline industry can use it to measure the effectiveness of a campaign, a pharma company can use it to measure the effectiveness of introducing a new drug to the market, a manufacturing company can use it for understanding the quality of a product being produced and so on. We will cover several business examples in this post in a more detailed.

WHAT is Hypothesis testing??

One of the major techniques used in inferential statistics is Hypothesis testing. The general notion to any scientific experiment is to start with a statement which is either to be proved or disproved by performing experiments. In case of Statistics, this argument or statement that is put forward is called a Hypothesis. The various tests performed to either prove or disprove this statement are called hypothesis tests. There are majorly five types of problems that we would generally come across in statistics. This post is an overview of these types of hypothesis and the various techniques used to test the same in the industry.

HOW of Hypothesis testing:

In the following section, the various hypothesis testing techniques along with their business use cases are presented. These are the basic and most widely used techniques in the analytics industry.

· Test of means within a group

Consider the case where an outbreak of a disease was attributed to the level of calcium content in the ice cream produced at a certain factory. Scientists measured the level of calcium in nine randomly sampled batches of ice cream. As a matter of fact, the scientists know that a calcium content of more than 0.3g/sample could be a reason for causing the disease. So, they want to understand if there is significant evidence that the mean level of calcium in the ice cream was greater than 0.3 g/sample or not. The general statistical technique used for testing such scenarios is called a one-sample t-test. The method would help in comparing the mean value of a certain parameter to the value that is hypothesized. Inferences will be made based on the results of the test.

· Test of means between two groups

Consider a case where six subjects were given a drug (treatment group) and an additional six subjects a placebo (control group) while preforming an experiment to understand if a drugs effectiveness is statistically significant or not. The reaction time to a stimulus was measured (in ms). If we want to perform a statistical analysis for comparing the mean reaction times of the treatment and control groups, the test we would use is called a two-sample t-test. The results of the test would help us infer if there was a significant impact because of the drug.

· Test of means between two groups with the same entity (paired tests)

Consider a petroleum company that claims the mileage offered by the company’s premium brand of gas is better than the regular gas. In order to test the argument, we can perform a study with ten cars. Each of the ten cars were first filled with regular gas and the mileage data collected. The mileage was recorded again for the same cars using the premium gasoline. Here, the statistical test that can be used to determine whether cars get significantly better mileage with premium gas or not is called a paired-t-test. We would basically compare the difference in the average mileage obtained by using the premium gas to the regular gas to Zero. If the difference in means is significantly more than zero, then we can infer that the premium gas is indeed providing a better mileage.

· Test of population Variance

Consider a company that is manufacturing a thermostat. The manager of the company claims that the error margin in the thermostat reading can be maximum of 0.5 degrees. This can be tested by measuring thermostats readings using ten different thermostats and analysis the spread of the values read by the instruments. If the spread is within the acceptable limit, the managers claim is true. This is different from the tests proposed in the earlier section. Here we want to test the spread of parameter otherwise called the Variance. The commonly used test of Variance is the Chi-Squared test. By conducting the test, we can verify of the products meet the specifications claimed. This type of analysis can be widely used in quality control methods.

· Test of means between many groups

We are often interested in determining whether the means from more than two populations or groups are equal or not. To test whether the difference in means is statistically significant we can perform analysis of variance (ANOVA). If the ANOVA F-test shows that there is a significant difference in means between the groups we may want to perform multiple comparisons between all pair-wise means to determine how they differ.

This post presented a brief overview of inferential statistics and Hypothesis testing. There are numerous industrial use cases for hypothesis testing which are shaping the way business decisions are made. Also, all of these tests can be implemented in tools like R, SAS, Python, SPSS and so on. We will cover the implementation of hypothesis testing using a dataset in a separate post in the coming weeks. For now, this posts presented an overview of why, what and how of inferential statistics and Hypothesis testing. Happy Learning!

Cheers!

Renga

Introduction and Motivation

Hello All - When I look around, I am fascinated by the way the world has transformed over the past decade with the digital revolution and data democratization. Gadgets have become ubiquitous and business decisions are driven by the data obtained from numerous data points available across platforms, impacting a plethora of fields like education, healthcare, retail, banking, sports and so on. Data Science and analytics being the core of this transformation, I am one among those who are fascinated about the field and aspiring to be at the forefront of this revolution. I am an engineer at heart which makes me the guy with a WHY wherever I go. Over the years I have also realized that learning and sharing knowledge is really important in a technical field like data science. I strongly believe that knowledge sharing is one of the most important asset for any individual and in that regard, this series of blog posts are a set of insights that I have gained with respect to Data Science...

The Why What and How of Data Science

Search This Blog