comprises four that have nearly identical simple , yet have very different distributions and appear very different when graphed. — Wikipedia Anscombe’s quartet data sets descriptive statistics can be defined as a group of four data sets which are , but there are some peculiarities in the dataset that if built. They have very different distributions and when plotted on scatter plots. Anscombe’s Quartet nearly identical in simple descriptive statistics fools the regression model appear differently It was constructed in 1973 by statistician to illustrate the of before analyzing and model building, and the effect of other . Francis Anscombe importance plotting the graphs observations on statistical properties There are these four data set plots which have nearly , which provides same statistical information that involves , and of all x,y points in all four datasets. same statistical observations variance mean This tells us about the importance of visualising the data before applying various algorithms out there to build models out of them which suggests that the data features must be plotted in order to see the distribution of the samples that can help you identify the various anomalies present in the data like outliers, diversity of the data, linear separability of the data, etc. Also, the Linear Regression can be only be considered a fit for the and is incapable of handling any other kind of datasets. data with linear relationships These four plots can be defined as follows: The statistical information for all these four datasets are approximately similar and can be computed as follows: When these models are plotted on a scatter plot, all datasets generates a different kind of plot that is not interpretable by any regression algorithm which is fooled by these peculiarities and can be seen as follows: The four datasets can be described as: this the linear regression model pretty well. Dataset 1: fits this linear regression model on the data quite well as the data is non-linear. Dataset 2: could not fit shows the involved in the dataset which by linear regression model. Dataset 3: outliers cannot be handled shows the involved in the dataset which by linear regression model. Dataset 4: outliers cannot be handled Conclusion We have described the four datasets that were intentionally created to describe the importance of data visualisation and how any regression algorithm can be fooled by the same. Hence, all the important features in the dataset must be visualised before implementing any machine learning algorithm on them which will help to make a good fit model. Thanks for reading. You can find my other . Machine Learning related posts here I hope this post has been useful. I appreciate feedback and constructive criticism. If you want to talk about this article or other related topics, you can drop me a text here or at . LinkedIn Previously published under a paywall.