“When a man is denied the right to live the life he believes in, he has no choice but to become an outlaw.” ― Nelson Mandela
Predictions, Forecasts and Loss scores. Sound too mainstream, don’t they?In the era of increasing interest towards Machine Learning and its algorithms, we are hugely ignoring important duties of being a data scientist, and one of those is Data Exploration.
We, the modern data scientists are so naive that we forget the beauty of Visualizations and the quality it stands for. Today, allow me to present you an Exploratory Data Analysis of the Kaggle Dataset : Crime in Chicago.
The Crimes in Chicago Dataset
I will be using the codes and visualizations from my Kernel which you can find here : Chicago Crime Mapping
Chicago Crime Mapping — At the time of editing
So, before starting off with the analysis, Let me brief you about the dataset, According to the briefings, it says:
This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system. In order to protect the privacy of crime victims, addresses are shown at the block level only and specific locations are not identified. Should you have questions about this dataset, you may contact the Research & Development Division of the Chicago Police Department at 312.745.6071 or [email protected].
Essentially, this dataset contains the type of Crime, Location, Sub Category of the Crime, Type of Vicinity and Whether the arrest was possible or not.
The very first step maybe to check if dataset contains any null values or not, and I used a heatmap to determine the same.
Viridis Heatmap
Looking at our heat map, we can safely conclude that there are not many values left out, so just go ahead and drop it.
I was curious to find out what maybe the number of crimes reported in these 5 years, and what I could see was :
STREET 325084RESIDENCE 223854APARTMENT 179444SIDEWALK 158478OTHER 53474PARKING LOT/GARAGE(NON.RESID.) 40907ALLEY 31239RESIDENTIAL YARD (FRONT/BACK) 30209SMALL RETAIL STORE 28209SCHOOL, PUBLIC, BUILDING 25474Name: Location Description, dtype: int64
Pretty high, for a span of 5 years.
One maybe thinking about where the crimes happened most. Is it the dirty streets, notorious residents or unguarded parking lots? We can check it for ourselves using this snippet:
plt.figure(figsize = (15, 10))
sns.countplot(y= 'Location Description', data = df, order = df['Location Description'].value_counts().iloc[:10].index)
Location Semantics
Apparently the Streets are the unsafest of all, while residence and apartments following close suite.
Let’s have a closer look at the unique locations where the crimes have taken place and use Folium to map them. You can use this snippet to recreate my map.
chicago_map_crime = folium.Map(location=[41.895140898, -87.624255632],zoom_start=13,tiles="CartoDB dark_matter")
for i in range(500):lat = CR_index['LocationCoord'].iloc[i][0]long = CR_index['LocationCoord'].iloc[i][1]radius = CR_index['ValueCount'].iloc[i] / 45
if CR\_index\['ValueCount'\].iloc\[i\] > 1000:
color = "#FF4500"
else:
color = "#008080"
popup\_text = """Latitude : {}<br>
Longitude : {}<br>
Criminal Incidents : {}<br>"""
popup\_text = popup\_text.format(lat,
long,
CR\_index\['ValueCount'\].iloc\[i\]
)
folium.CircleMarker(location = \[lat, long\], popup= popup\_text,radius = radius, color = color, fill = True).add\_to(chicago\_map\_crime)
Map of crimes
Here, the Orange Circles means that crimes taken place at that particular location are above 1000, while others are self explanatory. Clicking on those maps would show the Coordinates and the number of crimes taken place at that particular (Latitude, Longitude)
An example of details
I have a special interest in thefts and public peace disruptions, but let’s have a look at the latter one later. For now, let’s focus on the types of thefts taken place around Chicago in these 5 years.
Type of thefts in Chicago from 2012- 2017
Well, $500 thefts are pretty dominating for now. No?
If that’s not enough, let’s have a look at the way these thefts are split around the month. Have a look at this graph and allow me to explain the sudden plunge in crime scene statistics.
Thefts Per Month
Well, in August, no Superman or Batman arrived in the city to protect justice. It was just a algorithmic loss which resulted in a
NAN
value at August which I had to replace by 0 (Because I am lazy)
Here is the code, if you don’t believe me:
theft_in_months = pd.DataFrame({"thefts" : df_theft['Month'].value_counts(), "month" : df_theft["Month"].value_counts().index}, index = range(12))
theft_in_months.fillna(0, inplace=True)theft_in_months = theft_in_months.sort_values(['month'], ascending=[1])
theft_in_months.head()
You can try re-sampling the dataset with respect to date and you will realize that this data set, indeed contains 1854 days to be precise. Want to know the mapping of how many crimes were committed in a single day? Have a look at this graph then.
Thefts on a daily basis
Here, the noticable trend is a rise in curve at the start of the year and achieveing the peak at the mid point. somehwhere at June — July . After that it has an equally sharp drop to the initial number of crimes as the year started!
I promise that this is the last area of research in this article.
Anyways, if one may focus on the types of Public Peace Violations and their numerical distribution, one can easily point out that Reckless Conduct is the leader in this area and (thankfully) not Bomb and Arson threats.
Types of Public Peace Violations
While we are at it, let’s have a look at the Peace Disruption incidents around Chicago. In this map, the Orange Circles represent that Peace Disruptions at that location have exceeded the count of 30 in these 5 years and hence are a little sensitive spots to tread on.
Peace Disruption Locations
As you may have already judged, this is not a coding tutorial but a potential project starter. You can use this EDA in your notebooks keeping in mind the Apache 2.0 License and make your prediction models out of these ideas.
Or any other idea which may strike your mind.
Until next time, peace out.
Uddeshya Singh