Combines ideas from data science, humanities and social sciences. Enjoys thinking, science fiction and design.
How many phones will be sold in the next quarter? How many people will go down with the flu in the next month? How many times will the CPU crush? Count data regression can help us answer these questions. Counts are everywhere, so no matter your background, these data distributions will come in handy.
Data distribution tells us what the possible values of a variable are and how often these values occur. We can take anything from a person's height to IQ scores and see how frequent all the possible values are. Height and IQ scores are a great example of a so-called normal distribution that describes many other phenomena in the world.
However, when we work with counts, things get tricky.
Height is a continuous variable. Meaning every inch or centimeter increase or decrease represents the exact change in height. For instance, if someone is 1 inch (2.5 cm) taller than you, it is the same difference whether you are 5 ft or 6 ft tall.
In the case of counts, this is different. Imagine you go for a walk in your favorite park and decide to count how many squirrels you see. You go to this park every day after lunch and walk around for an hour. After a few weeks of collecting this data, you would see that on some days, there were none, and on other days you would see 2 or 3, even 5. The daily count of squirrels is a discrete variable that could be described or predicted with count data regression.
Linear regression is a way of describing a relationship between variables with a straight line. Imagine that you are a nature photographer and want to take pictures of the squirrel at your park or a different animal somewhere else. You want to optimize the amount of time you wait for squirrels to show up. In other words, you want to predict how many squirrels will show up in the next few days. You could start collecting information about sunlight, the busyness of the park and so on. These would be your predictive variables.
So far, so good, you have the data, and you can feed it into a regression model, i.e. equation. Now, the data distributions come in; what are your options?
I will cover 3* count data distributions:
(*there are other options though)
Poisson distribution is usually the starting point when you work with count data. It assumes that the mean value equals the variance. Variance tells you how much the possible values differ from each other.
Let's say you own two grocery stores, and you start counting how many customers come in every hour. You collected information for 10,000 hours from both of them, and you decided to assume the data follows the Poisson distribution. The average number of customers in the first store is 20 while 40 customers usually shop in the second one.
This is your model:
Overall, you can see that the Poisson distribution is a bit longer on the right side. This means that values in the first half (left side from the blue line) are less frequent than those in the second half. So if you fold the distribution in the middle, the right side will be longer.
Comparing what happens when the mean is different, notice the x-axis on both plots. As the mean increased, the maximum increased from around 40 to almost 80. This is because the Poisson distribution assumes that you only need one parameter to represent the variability in your data as well as the typical value.
However, this data distribution is often too simplistic. You might require more parameters to describe what is happening with your customers.
Count data regression with negative binomial distribution is an excellent option if the variance in your data is higher than the mean. It often happens with medical or public health data. Statisticians call this overdispersion. In this case, you need another parameter to capture the dispersion of the data.
Another way of thinking about this second parameter is the number of successes. Going back to the store owner example, you could include the number of customers that buy something. We were talking about grocery shopping. To make this simulation more realistic, let's count all the customers who spend over $100. How will the data distribution change if we see 1 vs 10 such customers?
Considering the first shop that serviced 20 customers in an hour on average, these are our models:
Both plots show the example with the mean of 20, but the dispersion (size) parameter is different. When we assume only 1 customer will spend over $100 in an hour, we see huge differences in how much customers spend. This model suggests that most customers spend less than $100, and fewer spend more than that. The second model assuming 10 customers spend over $100, shows more balanced customer spending.
Can we focus on specific items?
Zero-inflated distributions are combinations of either Poisson or negative binomial with a peak on the value zero. The zero is inflated - hence the name.
This is a handy option in many industries. Think about insurance claims. You pay a monthly insurance fee, but you won't claim it most of the time. For the insurance company, every month you don't claim anything will be recorded as zero. In fact, most people won't claim their insurance. Those who claim their insurance get various amounts that are covered by the Poisson or negative binomial distributions.
Similarly, this data distribution would be great for understanding spending on expensive items. Where I live, you can buy a TV or a Playstation in huge grocery stores. Most people will just buy food, and that's it. To sell something bigger is rare, so most people spend zero on technology there.
The zero-inflated model captures this as the probability that the zero value will occur. The parameter is illustrated below with Poisson Distribution.
The probability of zero value can also be understood as "how much of the original distribution we want to keep". We can see that the mean highlighted with the blue colour is not 20 anymore. Additionally, the final mean is different between the two plots. In the second plot, we assume more zero values, so the mean is closer to zero than the first plot.
There are many other options for count data regression. It's always best to visualize your data with a histogram and compare the shape with these essential distributions. If it doesn't seem to fit, try to search for other options specific to your context, e.g. epidemiology, insurance, engineering, etc.
If you feel like you need a bit more background on statistics in general, read:
For advice on how to clean your data, see: