Exploratory Analysis Part-1

July 19th 2017
Author profile picture

@andy12290Aniket Kale

I always feel that telling data story is very important part in data scientist life. Whenever, we have to tell the data story we need data visualization tools to explore the data. In Market we have plenty of tools to visualize the data.Today we are going to explore the Pandas and Matplotlib library of python.

Dataset = California Housing dataset

Load the Dataset using Pandas

Before loading data, we need to import all the libraries. which we are going to use in data exploration.

After loading the housing dataset in Pandas lib (housing.csv). we can see the first five records using data.head()

If we used function (data.shape) then we will figure out how many rows and columns present in the dataset. In our dataset, we have 20640 rows and 10 columns.after loading the dataset we can explore the number of pandas function to find out the missing values in columns. here is the link for official pandas various functions (http://pandas.pydata.org/pandas-docs/stable/)

To see the skewness of data, we are using the histogram plot to see the distribution of data.

After plotting the histogram for each column, we will come to know the outliers and overall distribution of data.In above fig, Household skew towards the right.

Scatter plot:

We will plot the scatter plot in between longitude and latitude to see the map of california. alpha parameter use to find out the more dense part in map.

In part 2, we are going to use more matplotlib functions and explore the data more to find out the insights and will create the new features before feeding to machine learning algorithms.

Stay tune for Part 2


Book: Hands on Machine learning with scikitlearn and Tensor flow



The Noonification banner

Subscribe to get your daily round-up of top tech stories!