Why ? Why can’t we leave just the way it is! Why to make it difficult? All these question were in my mind, when i was heading through this topic but as we know when there is loads of data suddenly randomness appears, she calls her sister probability with her. When there is probability we can’t leave Gaussian alone. All these words are so heavy that i will make sure to put some magic and make it easy for you. Let’s dive together. Probability linear regression Do you remember these equation from linear regression? If not, Please read the article on . Supervised Learning In probability theory, the very common continuous probability distribution is also known as or . This normal distribution is sometimes informally called as the as shown in the below image : Normal Distribution Gaussian Distribution bell curve The probability density of above normal distribution or Gaussian is given by : Where, ‘mu’ is the mean sigma is the standard deviation sigma square is the variance Now, as we have probability distribution equation and if we write the probability distribution of the predicted values ‘ ’ of the given ‘ ’ and the parameter is given as follows: Y X Theta The answer lies in the “ ”. This gives us an idea that when you take a bunch of random numbers from almost any distribution and add them together, you get something which is normally distributed. The more numbers you add, the more normally distributed it gets. In a typical machine problem, you will have errors from many different sources (e.g. measurement error, data entry error, classification error, corruption…) and it’s not completely unreasonable to think that the combined effect of all of these errors is approximately normal. Still the question remains, why always Gaussian? Why not something else? Central Limit Theorem learning data When we wish to explicitly view the probability distribution function of theta θ, we will instead call it the : Likelihood function When taken for ‘ ’ training examples, the likelihood function is modified as follows : m and are the same things. It is called likelihood because when we want to see the above equation as function of theta while keeping ‘X’ and ‘Y’ constant. Likelihood Probability says that “ ”. Instead of maximizing L(theta), we can also maximize any strictly increasing function of L(theta). In particular, the derivations will be a bit simpler if we instead maximize the we call it as ℓ(theta): Principle of maximum likelihood Choose theta to maximize the likelihood or choose parameters that makes data more probable as possible log likelihood which gives similar answer as minimizing J(theta): This probabilistic Interpretation will be further used in . Logistic Regression If you find any inconsistency in my post, please feel free to point out in the comments. Thanks for reading. If you want to connect with me. Please feel free to connect with me on LinkedIn. _View Sameer Negi's profile on LinkedIn, the world's largest professional community. Sameer has 3 jobs listed on their…_www.linkedin.com Sameer Negi - autonomous Vehicle Traniee - Infosys | LinkedIn