Unveiling the Hidden Gems: A Journey into Exploratory Data Analysis (EDA)

Master the Art of Extracting Insights and Uncovering Patterns in Your Data to Make Informed Decisions Introduction Exploratory Data Analysis (EDA) is an essential step in data analysis, permitting developers to comprehend the structure and relationships of the data they are working with. EDA analyzes and summarizes datasets to discover anomalies, patterns, or correlations. Also, it assists in creating hypotheses about the underlying processes that produce the data. A well-executed EDA process can significantly impact the project’s overall success. This is because EDA provides data insights that can boost the data analysis and modeling processes. I will examine strategies and recommended practices for performing EDA in this article. , , , and will all be discussed. Data cleaning visualization descriptive statistics hypothesis testing Code examples were generated with AI help. Data Cleaning Data cleaning is detecting and removing errors, inconsistencies, and anomalies in data that could compromise the accuracy of future analysis and modeling. Missing Values One of the most common issues with real-world data is missing values. To deal with it, firstly, determine why they are missing. If the missing data are random, impute them using methods such as mean imputation, median imputation, or regression imputation. The imputation results can be biased and misleading if the missing data are not missing at random. In the following example, I will use the IterativeImputer class from the library to impute missing values in a dataset: scikit-learn import numpy as np import pandas as pd from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer # Generate data with missing values np.random.seed(0) X_true = np.arange(10).reshape(-1, 1) + np.random.normal(scale=0.1, size=(10, 1)) X = X_true.copy() mask = np.random.choice([True, False], size=X.shape, p=[0.3, 0.7]) X[mask] = np.nan # Impute missing values imp = IterativeImputer(max_iter=10, random_state=0) X_imp = imp.fit_transform(X) # Compare imputed values with true values df = pd.DataFrame({'True': X_true.flatten(), 'Imputed': X_imp.flatten()}) print(df) In this example, I create data with missing values by randomly assigning 30% of the values to . I then use the to impute the missing values in each iteration, using the and of the remaining data. The imputed values are subsequently compared to the true ones to confirm the correctness of the imputation process. np.nan IterativeImputer mean variance In this example, the values of a highly correlated variable can be used to fill in the missing values: import numpy as np import pandas as pd import seaborn as sns # Generate data with missing values np.random.seed(0) data = np.arange(20).reshape(-1, 2) + np.random.normal(scale=0.1, size=(10, 2)) data[1, 1] = np.nan data[2, 0] = np.nan # Convert data to pandas dataframe df = pd.DataFrame(data, columns=["col1", "col2"]) # Visualize correlation between columns sns.heatmap(df.corr(), annot=True) plt.show() # Fill missing values in col1 with values from col2 df["col1"].fillna(df["col2"], inplace=True) # Fill remaining missing values in col2 with values from col1 df["col2"].fillna(df["col1"], inplace=True) I initially got the correlation between the columns with a heatmap from the library. The heatmap shows a strong association between the columns, with a correlation coefficient close to one. It helps me to fill in the missing values in using values from , and the remaining ones in using values from . Seaborn col1 col2 col2 col1 Outliers Outliers are extreme values in data that can significantly impact statistical analysis and modeling results. Therefore, detecting and addressing outliers is necessary to obtain accurate data outputs. Here, I will demonstrate how to use the Z-score method and visualization tools to identify and remove outliers from the data: import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Generate data with outliers def generate_outliers(n, mu, sigma): x = np.random.normal(mu, sigma, n) x[:5] = x[:5] - 3 * sigma x[-5:] = x[-5:] + 3 * sigma return x np.random.seed(0) X = generate_outliers(100, 1, 3) # Detect outliers using Z-score method mean = np.mean(X) std = np.std(X) z_scores = (X - mean) / std outliers = np.where(np.abs(z_scores) > 3) # Remove outliers from data X_clean = np.delete(X, outliers, axis=0) # Visualize data with outliers and without outliers sns.boxplot(x=X.flatten()) plt.show() sns.boxplot(x=X_clean.flatten()) plt.show() I have data with two outliers, values much higher or lower than the rest. The Z-score method helps to identify outliers by calculating Z-scores for each value in the dataset and marking with Z-scores greater than as outliers. I then use the method to remove any outliers from the dataset. Finally, I can compare the dataset with and without outliers using box plots. The box plot clearly demonstrates that the dataset’s outliers were removed. 3 np.delete Data visualization Data visualization aids in understanding the underlying data structure. There are many libraries for visualization, like , , , and . I will focus on the first three as the most popular. Matplotlib Seaborn Plotly Bokeh Matplotlib is a Python data visualization library that includes a variety of plots, such as , , , , and more. The following is an example of a Matplotlib line plots scatter plots bar plots histograms line plot: import numpy as np import matplotlib.pyplot as plt np.random.seed(0) data = np.random.normal(size=100) plt.plot(data) plt.title("Line Plot using Matplotlib") plt.xlabel("Index") plt.ylabel("Value") plt.show() Seaborn is a Python data visualization toolkit built on that provides a high-level interface for constructing various plots. The following is an example of the : Seaborn Matplotlib scatter plot import numpy as np import seaborn as sns np.random.seed(0) data1 = np.random.normal(loc=0, scale=1, size=100) data2 = np.random.normal(loc=2, scale=1, size=100) sns.scatterplot(data1, data2) plt.title("Scatter Plot using Seaborn") plt.xlabel("Data 1") plt.ylabel("Data 2") plt.show() Plotly is a powerful Python data visualization package that generates interactive and visually appealing graphs. So, I will make numerous plots, including a , , , , and . Plotly scatter plot line plot bar plot histogram box plot import numpy as np import plotly.express as px import plotly.graph_objs as go np.random.seed(0) data1 = np.random.normal(loc=0, scale=1, size=100) data2 = np.random.normal(loc=2, scale=1, size=100) data3 = np.random.randint(low=1, high=10, size=100) # Scatter Plot fig = px.scatter(x=data1, y=data2) fig.update_layout(title="Scatter Plot using Plotly") fig.show() # Line Plot fig = go.Figure() fig.add_trace(go.Scatter(x=np.arange(100), y=data1, name="Data 1")) fig.add_trace(go.Scatter(x=np.arange(100), y=data2, name="Data 2")) fig.update_layout(title="Line Plot using Plotly") fig.show() # Bar Plot fig = px.bar(x=np.arange(100), y=data3) fig.update_layout(title="Bar Plot using Plotly") fig.show() # Histogram fig = px.histogram(data1) fig.update_layout(title="Histogram using Plotly") fig.show() # Box Plot fig = px.box(data2) fig.update_layout(title="Box Plot using Plotly") fig.show() I first produce three sets of random data and then create several plots using the , , , , and functions. The interactivity of the plots, which allows zooming, panning, and hovering over points to display extra information, is one advantage of utilizing Plotly. Plotly also supports exporting plots in various formats, including HTML, SVG, and PNG, making sharing and presenting the visualizations simple. px.scatter go.Figure px.bar px.histogram px.box Let’s create a 3D plot for gradient descent to show the power of visualization. Gradient descent is an optimization algorithm commonly used in machine learning to minimize a loss function by updating the parameters in the direction of steepest decrease. It is a first-order optimization method that iteratively updates the parameters in the direction of the negative gradient of the loss function with respect to those parameters. import numpy as np import plotly.graph_objects as go # Define the quadratic function and its gradient def quadratic(x, y): return x**2 + y**2 def grad_quadratic(x, y): grad_x = 2 * x grad_y = 2 * y return grad_x, grad_y # Visualize the surface of the quadratic function x = np.linspace(-2, 2, 30) y = np.linspace(-2, 2, 30) X, Y = np.meshgrid(x, y) Z = quadratic(X, Y) fig = go.Figure(data=[go.Surface(z=Z, x=X, y=Y)]) fig.show() # Optimize the quadratic function using gradient descent def gradient_descent(x0, y0, n_iters, lr): path = np.zeros((n_iters+1, 2)) path[0, :] = [x0, y0] for i in range(n_iters): grad_x, grad_y = grad_quadratic(path[i, 0], path[i, 1]) path[i+1, 0] = path[i, 0] - lr * grad_x path[i+1, 1] = path[i, 1] - lr * grad_y return path path = gradient_descent(x0=-1.5, y0=-1.5, n_iters=100, lr=0.01) # Visualize the optimization process in 3D fig = go.Figure( data=[go.Scatter3d( x=path[:, 0], y=path[:, 1], z=quadratic(path[:, 0], path[:, 1]), mode='markers', marker=dict( size=3, color=quadratic(path[:, 0], path[:, 1]), colorscale='Viridis', opacity=0.8 ) )] ) fig.update_layout( scene=dict(xaxis_title='X', yaxis_title='Y', zaxis_title='Z')) fig.show() The code sample implements gradient descent optimization with 3D visualization using the library. The code begins by defining a function and its gradient, , employed in the optimization process. The library is then used to display the function’s surface, where and values are defined using , and values are computed by evaluating the quadratic function over a grid of and values. A instance is created with the , , and values and added to a instance. The show method is used to display the Figure instance. plotly.graph_objects quadratic grad_quadratic plotly.graph_objects quadratic x y np.linspace z x y go.Surface x y z go.Figure Next, the gradient descent optimization is performed using the gradient_descent function, which inputs the starting point and , the number of iterations , and the learning rate . The optimization process is stored in the path variable, which contains the sequence of points visited during optimization. x0 y0 n_iters lr Finally, the optimization process is visualized in 3D using the library. A instance is created with the , , and values, which are the optimized path and the corresponding function values. The size and color of the markers are also defined in the marker parameter. The instance is then added to a instance and the scene’s layout is updated with appropriate axis labels. The final visualization is displayed using the show method of the instance. plotly.graph_objects go.Scatter3d x y z go.Scatter3d go.Figure go.Figure This code provides a clear visual representation of the optimization process of gradient descent in 3D and can be a valuable tool for understanding and debugging optimization algorithms. Descriptive statistics EDA relies heavily on descriptive statistics. They summarize a dataset’s main characteristics, such as central tendency, dispersion, and shape. I will review the fundamentals of descriptive statistics using examples from the top Python libraries. Pandas is one of the most used Python packages for descriptive statistics. It has a lot of methods for quickly calculating descriptive statistics on a dataset, such as , , , , and others. Here’s an example of and mean median mode standard deviation mean standard deviation: import pandas as pd data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] df = pd.DataFrame(data, columns=['Values']) mean = df['Values'].mean() std = df['Values'].std() print('Mean:', mean) print('Standard Deviation:', std) Another popular descriptive statistics library is , which includes functions for calculating various statistical tests and measurements. The , for example, includes functions for determining the skewness and kurtosis of a dataset, which provide information on the distribution’s shape: SciPy SciPy from scipy.stats import skew, kurtosis skewness = skew(data) kurt = kurtosis(data) print('Skewness:', skewness) print('Kurtosis:', kurt) However, more advanced statistical approaches are often required, and a lesser-known library might be helpful in these circumstances. Statsmodels is one such package that offers a complete set of functions for statistical modeling, hypothesis testing, and data exploration. Here’s an example of how to use Statsmodels to compute a dataset’s distribution: import statsmodels.api as sm results = sm.ProbPlot(data).qqplot() plt.show() The Q-Q plot demonstrates that the data distribution deviates from a normal distribution, which is to be expected given that the dataset is a blend of two normal distributions. Hypothesis testing Hypothesis testing is the final step in descriptive statistics analysis. It is a statistical strategy for determining whether an observed result is random or statistically significant. It is an essential component of EDA and is used to conclude population parameters based on sample data. A null hypothesis and an alternate hypothesis are established during hypothesis testing. The null hypothesis is the default assumption that no difference exists between the population parameters under consideration. The alternative hypothesis, which argues that there is a difference between the population parameters, is the inverse of the null hypothesis. A test statistic is calculated based on the sample data to test the hypothesis. A is derived, representing the likelihood of observing a result that is as severe or more extreme than the one observed if the null hypothesis is true. If the is less than a certain threshold, such as 0.05 (this constant is commonly used), the null hypothesis is rejected, and the alternative hypothesis is accepted. p-value p-value Python provides a wide range of hypothesis testing tools, including but not limited to: : This module contains a variety of statistical functions, such as hypothesis testing functions for t-tests, ANOVA, and others. scipy.stats : This library includes a set of hypothesis testing routines, such as regression and time series analysis. statsmodels : This probabilistic programming framework supports Bayesian hypothesis testing and Markov Chain Monte Carlo (MCMC) simulation. PyMC3 Let’s test the hypothesis that the mean cholesterol level of patients who received a new medicine is the same as that of patients who did not receive the treatment. To test the hypothesis, I will use a two-sample t-test. The t-statistic calculates the mean difference between two samples in standard error units. A large t-statistic implies a significant difference in means, whereas a small t-statistic suggests a minor difference. import pandas as pd import numpy as np import scipy.stats as stats import statsmodels.stats.weightstats as ssw # Generating the data np.random.seed(42) drug_treated = np.random.normal(loc=195, scale=25, size=100) no_drug = np.random.normal(loc=190, scale=30, size=100) # Performing two-sample t-test t_statistic, p_value = stats.ttest_ind(drug_treated, no_drug) # Checking the results if p_value < 0.05: print("Reject null hypothesis. Mean cholesterol levels are not equal.") else: print("Fail to reject null hypothesis. Mean cholesterol levels are equal.") If the null hypothesis (the means are equal) is true, the p-value denotes the likelihood of observing a t-statistic as extreme or more extreme than the one calculated. A low p-value implies that the observed difference between means is statistically significant, and the null hypothesis is rejected. So, if the p-value is less than 0.05, the null hypothesis is rejected, and I can conclude that the mean cholesterol levels of patients treated with the new medicine are not equal to those of patients who did not get the treatment. If the p-value exceeds 0.05, the null hypothesis is not rejected, meaning the mean cholesterol levels are identical. Example Let’s try to do EDA on a healthcare dataset in this example. The dataset will include patient information such as age, height, weight, and blood pressure readings. To do EDA, I will employ all four steps — , , and . data generation, data cleaning descriptive statistics visualization Let’s start by generating our data: import numpy as np np.random.seed(0) age = np.random.normal(loc=30, scale=10, size=1000) height = np.random.normal(loc=180, scale=15, size=1000) weight = np.random.normal(loc=80, scale=20, size=1000) blood_pressure = np.random.normal(loc=120, scale=10, size=1000) After I create data, I will clean it up by looking for missing numbers or outliers. Data doesn’t have any missing numbers in this case, so it is necessary to go on looking for outliers: from scipy import stats z_scores_age = stats.zscore(age) z_scores_height = stats.zscore(height) z_scores_weight = stats.zscore(weight) z_scores_blood_pressure = stats.zscore(blood_pressure) Following that, I will use the package to generate a DataFrame from the data and compute some descriptive statistics: Pandas import pandas as pd df = pd.DataFrame( { 'Age': age, 'Height': height, 'Weight': weight, 'Blood Pressure': blood_pressure } ) print(df.describe()) For each column in the , the function will provide the following summary statistics: , , , , , , , and . DataFrame describe count mean standard deviation minimum 25th percentile median 75th percentile maximum Next, I will use to run a one-sample to see if patients’ mean age differs substantially from 30: SciPy t-test t_statistic, p_value = stats.ttest_1samp(df['Age'], 30) print('t-statistic:', t_statistic) print('p-value:', p_value) So, I can reject the null hypothesis and declare that the mean age of our patients is significantly different from 30 if the is less than 0.05. p-value Finally, I will use the to perform linear regression on data: statsmodels import statsmodels.api as sm X = df[['Age', 'Height', 'Weight']] y = df['Blood Pressure'] model = sm.OLS(y, X).fit() predictions = model.predict(X) print(model.summary()) The code above imports and uses the module to fit an ordinary least squares (OLS) regression model. OLS regression is a linear regression that is used to model the connection between one or more independent variables and a dependent variable. In this example, the dependent variable is blood pressure, and the independent variables are age, height, and weight. statsmodels.api By passing in the dependent variable ( ) and the independent variables, the function generates a model object ( ). The model object’s fit method is then called to fit the OLS regression to the data. The method then provides predictions based on the fitted model and the independent variables. y sm.OLS X Finally, the function is invoked to summarize the model’s statistical results, including the independent variable coefficients, the goodness of fit metrics, and the importance of each independent variable in predicting the dependent variable. The summary is helpful for the data analyst in evaluating the fit and performance of the OLS regression model. model.summary Conclusion EDA is a very important step in the data science process, enabling the discovery of patterns, relationships, and anomalies in data. It helps gain insights into the data, understand the distribution and correlations of variables, and discover potential areas for further investigation. However, I have to mention several drawbacks to EDA. One of the most significant pitfalls is the subjectivity of EDA, with outcomes influenced by the analyst’s background and personal biases. Furthermore, EDA may not provide definitive answers but rather a preliminary interpretation of the data. Another moment of EDA that should be considered is that it can be time-consuming and labor-intensive, especially when you work with large and complex datasets. Additionally, EDA may not be suitable for all types of data, such as high-dimensional ones.