The “Maybe Just a Quick One” series title is inspired by my most common reply to “Fancy a drink?”, which, may or may not end up in a long night. Likewise, these posts are intended to be short but I get carried away sometimes, so, apologies in advance. What is Exploratory Data Analysis and why is it important? So we know what data is and we know what analysis is. But what is the meaning of “exploratory” in a data science context? What kind of conclusions are we trying to reach? Well, there are various reasons that make this step a necessity in a data science project's lifecycle, helping us to: Make a good judgement on the quality of our data. Identify any missing values, outliers, possible differences in measurement units etc. Take a closer look at the characteristics of the variables like their types, distributions, variance as well as the correlations between them. Summarise the data, providing an “at a glance” way to understand it. And this works beneficially not just for the data scientist or the developer but any stakeholder involved in the project . One of the most common ways to achieve this is by using graphs and plots, in other words, to visualize the data. What is Autoviz? is a Python library that can massively speed up the visualization of our data, making it fully automated. Let’s jump straight into coding. I am a firm believer in learning by doing. Autoviz First things first. We will create a new Conda environment and install the necessary packages: conda create -n autoviz python= conda activate autoviz python -m pip install autoviz conda install scikit-learn 3.8 The reason I included scikit-learn is to use some of its datasets to demonstrate the use of Autoviz. You can of course download the dataset from other sources and skip this step. Now, create a new notebook and start by importing the packages. I will use the infamous Boston House Prices dataset first. autoviz.AutoViz_Class AutoViz_Class sklearn.datasets load_boston,load_iris pandas pd boston = load_boston() df_boston = pd.DataFrame(data=boston.data, columns=boston.feature_names) df_boston[ ] = boston.target from import from import import as "btarget" We can now instantiate Autoviz and let it do its magic: AV = AutoViz_Class() filename = sep = dft = AV.AutoViz( filename, sep= , depVar= , dfte=df_boston, header= , verbose= , lowess=False, chart_format= , max_rows_analyzed= , max_cols_analyzed= , ) "" "," "," "btarget" 0 2 "svg" 150000 30 As you might have noticed, there are some arguments passed to AutoViz, but what do they mean? Let's see what the documentation says: - Make sure that you give filename as empty string ("") if there is no filename associated with this data and you want to use a dataframe, then use dfte to give the name of the dataframe. Otherwise, fill in the file name and leave dfte as empty string. Only one of these two is needed to load the data set. filename - this is the separator in the file. It can be comma, semi-colon or tab or any value that you see in your file that separates each column. sep - target variable in your dataset. You can leave it as empty string if you don't have a target variable in your data. depVar - this is the input dataframe in case you want to load a pandas dataframe to plot charts. In that case, leave filename as an empty string. dfte - the row number of the header row in your file. If it is the first row, then this must be zero. header - it has 3 acceptable values: 0, 1 or 2. With , you get all charts but limited info. With you get all charts and more info. With , you will not see any charts but they will be quietly generated and save in your local current directory under the AutoViz_Plots directory which will be created. Make sure you delete this folder periodically, otherwise, you will have lots of charts saved here if you used verbose= option a lot. verbose zero 1 2 2 - this option is very nice for small datasets where you can see regression lines for each pair of continuous variable against the target variable. Don't use this for large data sets (that is over 100,000 rows) lowess - this can be SVG, PNG or JPG. You will get charts generated and saved in this format if you used verbose=2 option. Very useful for generating charts and using them later. chart_format - limits the max number of rows that is used to display charts. If you have a very large data set with millions of rows, then use this option to limit the amount of time it takes to generate charts. We will take a statistically valid sample. max_rows_analyzed - limits the number of continuous vars that can be analyzed max_cols_analyzed Wait...that was it? Yes. It is that simple. Using 2 as the level had the charts generated in the folder. Let's take a look at some of them: verbose AutoViz_Plots Violin Plots Scatter Plots Heatmaps and more. You get the gist of it. (I hear you say) That is cool. I won't need to do any data vizualisation myself anymore! Not quite. I believe that we need the best of both worlds. Having an automated data visualization tool like Autoviz to quickly generate some graphs for your data is a great first step. It can very quickly give you a good summary of it. However, you might need to dig deeper and create some plots yourself, depending on the task. Further reading: Autoviz home page