The Essential Statistics for Data Science

Shashank Gollapalli
6 min readApr 3, 2023

--

As a data scientist, statistics is an essential tool in your toolkit. It helps you to make sense of the data you’re working with and to draw meaningful insights from it. In this article, I’ll be discussing some of the key topics in statistics that are crucial for data science: descriptive statistics, inferential statistics, probability theory and experimental design.

Overview Image

Descriptive statistics is the branch of statistics that deals with the analysis and interpretation of data. It involves summarizing and describing the main features of a dataset, such as the mean, median, mode, standard deviation, variance, and range. Descriptive statistics is a key component of data analysis, as it helps to uncover patterns and trends in the data.

For example, let’s say you’re working with a dataset that contains information on the heights of a group of people. You could use descriptive statistics to calculate the average height, the range of heights, and the standard deviation of the heights.

One of the main concepts in descriptive statistics is central tendency, which is a measure of the center of the distribution. The three most commonly used measures of central tendency are the mean, median, and mode.

The mean, median and mode of data multiple distributions
Mean, Median and Mode for data with different skews

The mean is simply the average of all the values in the dataset. It is calculated by adding up all the values and dividing by the total number of values. For example, if we have a dataset of ages {25, 30, 35, 40, 45}, the mean age would be (25+30+35+40+45)/5 = 35.

The median, on the other hand, is the middle value of the dataset. To calculate the median, we first arrange the data in order from smallest to largest, and then find the middle value. If there are an odd number of values, the median is the middle value. If there are an even number of values, the median is the average of the two middle values. Using the same dataset of ages as before, the median age would be 35.

The mode is the most commonly occurring value in the dataset. If there are multiple values that occur with the same frequency, then there may be more than one mode. For example, if we have a dataset of test scores {80, 85, 90, 90, 95}, the mode would be 90.

Standard deviation, variance, and range are measures of variability in a dataset.

Standard deviation is a measure of how spread out the data is from the mean. It tells us how much the individual data points deviate from the average. A low standard deviation means that the data is clustered around the mean, while a high standard deviation means that the data is more spread out.

Variance is a measure of the average squared deviation from the mean. It is similar to standard deviation but it is expressed in squared units. The variance provides an idea of how spread out the data is from the mean.

Range is simply the difference between the largest and smallest values in a dataset. It gives an idea of how much variation there is in the data. However, it doesn’t take into account the distribution of data points between the minimum and maximum values, unlike standard deviation and variance.

Companies that use descriptive statistics for day to day business. Image Source: Squared Space

Companies in the retail industry like Walmart, Target, and Amazon use descriptive statistics to analyze sales data and customer behavior to understand trends and patterns. They can use this information to make decisions about inventory, pricing, and marketing strategies.

Inferential statistics is the branch of statistics that deals with making predictions and inferences about a larger population based on a sample of data. This is important in data science because we often work with large datasets, but we cannot possibly collect data from the entire population. Instead, we take a smaller sample from the population and use inferential statistics to make predictions and draw conclusions about the population as a whole.

For example, we may infer that a new marketing campaign will be successful based on the positive response of a sample group, but acknowledge that there is some degree of uncertainty in making this prediction.

Standard deviation and variance are two commonly used measures in inferential statistics.

A small standard deviation indicates that the data points are clustered closely around the mean, while a large standard deviation indicates that the data points are more spread out.

Common techniques of inferential statistics include Hypothesis testing, Confidence intervals, Regression analysis, Analysis of variance (ANOVA), Chi-squared test and t-test.

Understanding inferential statistics is crucial for making accurate predictions and inferences in data science. By using statistical tests to analyze sample data and drawing conclusions about the larger population, we can make informed decisions and predictions in a variety of fields.

Pfizer Logo. Image Source: Twitter

Pharmaceutical companies like Pfizer use inferential statistics to analyze clinical trial data to determine whether a new drug is safe and effective. They can use this information to make decisions about whether to seek FDA approval for the drug.

Probability theory is the branch of mathematics that deals with the analysis of random events. It involves calculating the likelihood of an event occurring based on a set of possible outcomes. Probability theory in data science, helps to predict the likelihood of future events.

For example, let’s say a company wants to estimate the likelihood of a customer returning to purchase their product again in the future. They collect data on previous customers and their purchasing behavior and use probability theory to analyze the data to calculate the probability of a customer returning to make another purchase.

Insurance companies also use probability theory to calculate the likelihood of different types of events occurring (such as car accidents or home burglaries) and use that information to set premiums for their policies.

Experimental design is the process of planning and executing experiments in order to test a hypothesis. It involves designing experiments that can produce valid and reliable results. Experimental design helps ensure that the data collected is accurate and useful.

Google and Facebook Logo. Image Source: RedChalkStudios

Technology companies like Google and Facebook use experimental design to test different versions of their products or features with a small group of users to see which one performs best. The usual approah would be to conduct an A/B test (experiment), where a random sample of website visitors is split into two groups: one group sees the website with the new feature, while the other sees the website without the new feature (the control group). They can use this information to make decisions about which features to roll out to a larger audience.

In conclusion, a solid foundation in statistics is essential for any data scientist. Descriptive statistics allow us to summarize and understand data, while inferential statistics enable us to make predictions and draw conclusions about a larger population. Probability theory provides a framework for understanding uncertainty and randomness, while experimental design ensures that our analyses are rigorous and unbiased.

By mastering these essential statistical concepts, data scientists can gain a deeper understanding of the data they are working with, and make more informed decisions. As the field of data science continues to evolve, it is clear that a strong foundation in statistics will remain a critical component of any data scientist’s toolkit.

Checkout my previous article on types of datasets.

Lets’s connect on LinkedIn!

--

--

Shashank Gollapalli

MSc Big Data Analytics for Business | On a mission to make data science accessbile