Inquiring Nomad. Reluctant geek.
The “Maybe Just a Quick One” series title is inspired by my most common reply to “Fancy a drink?”, which, may or may not end up in a long night. Likewise, these posts are intended to be short but I get carried away sometimes, so, apologies in advance.
So we know what data is and we know what analysis is. But what is the meaning of “exploratory” in a data science context? What kind of conclusions are we trying to reach? Well, there are various reasons that make this step a necessity in a data science project's lifecycle, helping us to:
Autoviz is a Python library that can massively speed up the visualization of our data, making it fully automated. Let’s jump straight into coding. I am a firm believer in learning by doing.
First things first. We will create a new Conda environment and install the necessary packages:
conda create -n autoviz python=3.8 conda activate autoviz python -m pip install autoviz conda install scikit-learn
The reason I included scikit-learn is to use some of its datasets to demonstrate the use of Autoviz. You can of course download the dataset from other sources and skip this step.
Now, create a new notebook and start by importing the packages. I will use the infamous Boston House Prices dataset first.
from autoviz.AutoViz_Class import AutoViz_Class from sklearn.datasets import load_boston,load_iris import pandas as pd boston = load_boston() df_boston = pd.DataFrame(data=boston.data, columns=boston.feature_names) df_boston["btarget"] = boston.target
We can now instantiate Autoviz and let it do its magic:
AV = AutoViz_Class() filename = "" sep = "," dft = AV.AutoViz( filename, sep=",", depVar="btarget", dfte=df_boston, header=0, verbose=2, lowess=False, chart_format="svg", max_rows_analyzed=150000, max_cols_analyzed=30, )
As you might have noticed, there are some arguments passed to AutoViz, but what do they mean? Let's see what the documentation says:
- Make sure that you give filename as empty string ("") if there is no filename associated with this data and you want to use a dataframe, then use dfte to give the name of the dataframe. Otherwise, fill in the file name and leave dfte as empty string. Only one of these two is needed to load the data set.
- this is the separator in the file. It can be comma, semi-colon or tab or any value that you see in your file that separates each column.
- target variable in your dataset. You can leave it as empty string if you don't have a target variable in your data.
- this is the input dataframe in case you want to load a pandas dataframe to plot charts. In that case, leave filename as an empty string.
- the row number of the header row in your file. If it is the first row, then this must be zero.
- it has 3 acceptable values: 0, 1 or 2. With zero, you get all charts but limited info. With 1 you get all charts and more info. With 2, you will not see any charts but they will be quietly generated and save in your local current directory under the AutoViz_Plots directory which will be created. Make sure you delete this folder periodically, otherwise, you will have lots of charts saved here if you used verbose=2 option a lot.
- this option is very nice for small datasets where you can see regression lines for each pair of continuous variable against the target variable. Don't use this for large data sets (that is over 100,000 rows)
- this can be SVG, PNG or JPG. You will get charts generated and saved in this format if you used verbose=2 option. Very useful for generating charts and using them later.
- limits the max number of rows that is used to display charts. If you have a very large data set with millions of rows, then use this option to limit the amount of time it takes to generate charts. We will take a statistically valid sample.
- limits the number of continuous vars that can be analyzed
Wait...that was it?
Yes. It is that simple. Using 2 as the
level had the charts generated in the
folder. Let's take a look at some of them:
and more. You get the gist of it.
That is cool. I won't need to do any data vizualisation myself anymore! (I hear you say)
Not quite. I believe that we need the best of both worlds. Having an automated data visualization tool like Autoviz to quickly generate some graphs for your data is a great first step. It can very quickly give you a good summary of it. However, you might need to dig deeper and create some plots yourself, depending on the task.
Create your free account to unlock your custom reading experience.