Originally Posted Here
Whether you are just breaking into data science, or you are looking to improve your data science skills. Books are one great method to get a base level understanding of specific topics. Now, we personally believe nothing beats experience, but in lieu of that, taking a course or reading a book is a great way possibilities that you can build on later when you are trying to practically approach data science.
In data science, there are many topics to cover, so we wanted to focused on several specific topics. This post will cover books on python, R programming, big data, SQL and just some generally good reads for data scientists.
Heads Up! — This post contains referral links from Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.
As a data scientist, you have a very important role. Your goal is to provide your company insights into improving the companies bottom or top line. The problem is, we can make data say anything we want. It can be very easy to manipulate data to prove that our feature was effective and it can be tempting if the company incentivizes that type of behavior.
Thus, a great general read for data scientists (and really anyone in our modern world) is Naked Statistics. This is kind of like the much older book How To Lie With Statistics which you can read for free.
We do prefer Naked Statistics because it is a little more modern and covers much more complex statistical debauchery than its much older counterpart. It just goes to show you that numbers are at your whim and you have a lot of responsibility to make sure your numbers are right. If something seems amiss with your data…it probably is. Rather than reporting it out right away, think about how you might unknowingly be miss representing the facts.
Another similar book is Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are.
Now, this takes it from the perspective of the people creating the data..us…vs the data scientist. It discusses how surveys, research, and reporting have all skewed data because…we lie. In counter to, the internet is often very truthful. We search when we are sick, when we are hungry, when we are in love, etc. We google great ideas for dates or am I dying. We are literally telling the internet what we are thinking about. This book covers that as well as how this impacts our ability as a data scientist to accurately create models based on this.
Finally, for books that we have read that we found helpful in our journey is Story Telling With Data. One thing we enjoyed about this book is it doesn’t just cover what to do, but what not to do. When you first start developing charts and models it is tempting to over clutter with every possible feature that Tableau and D3 offer us. But, honestly, those features might drown out the impact you are trying to make. This book takes an entire chapter to discuss avoiding over cluttering and it is great for those of us who need to remember to hold back.
Finally, for the general data science and machine learning type of books we wanted to recommend Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. You can actually read this book for free here it will download the book automatically. This book tends to be language agnostic and instead abstracts the programming away and focuses on actual data problems.
For instance, one concept that is not always clear is how to analyze events on a personal level over time. What we mean is, is how do we analyze the impact when everyone experiences events at different times? This book will show you how to line up data to more accurately analyze this type of data, but it won’t give you the specific SQL or Pandas code written out. This is one of the better ways to understand data science vs. being locked into a specific language. So we really appreciate this bool. Plus, there is a free option. So you don’t even have to buy it.
R free book!
The constant question about R vs python is somewhat irrelevant. At the end of the day, where you work will more likely define what language you use. It is better to have a general data science process that works well vs pigeon hole your self to one or the other. Nevertheless, here are some great R books, 1 is even free!
Our first R programming book is a free book in pdf format. It is a classic and honestly cliche book An Introduction to Statistical Learning: With Applications in R!
If you ask any data scientist what some of their favorite books are, this is almost guaranteed to be one of them (even if they didn’t read it). Now if you really want to buy the book, then go ahead :). Some forewarnings, this book tends to be very technical from a math side. This can be very discouraging for some new data scientists. If that is the case for you, then please check out the books below. It is important to understand the math behind the models, but sometimes the models are so abstracted by greek symbols that unless you understand them, you will not get the explanations in this book
Another classic R Cookbook: Proven Recipes for Data Analysis, Statistics, and Graphics. This is a great intro book because it really only covers linear regression, graphics, and time series analysis in depth. There are several other subjects discussed but they are not as statistical focused or as in-depth as the subjects mentioned above. The value of this book is still very much there! In fact, we would recommend you review your statistics first, then read this book and then read An Introduction To Statistical Learning In R (now we need to create a study guide for our next post).
Another book we weren’t as big fans of was Hands-On Programming with R: Write Your Own Functions and Simulations. The work that this book as you go through just wasn’t as practical in our experience. Most of the data science work we have done does not require simulations. This is not to say it might not be valuable for some, but this would be a book we would first recommend you
Python has a lot of different applications. It can be used operationally to automate system admin work, it can be used to develop websites and analytically. Today we are going to be focusing on the analytical/data science/ machine learning libraries that Python has to offer.
To start off, there are Pandas is an analytical library in python that allows you to manipulate data frames, perform basic analytics and transformations on data. Plenty of Universities out there provide free pdf that you can look through on Pandas. If you do prefer hard copy books then you can first check out O’Reilly’s 10 day free trial before buying any of the O’Reilly books we reference.
The classic Pandas Book is Python For Data Analysis.
This book will cover all the basics, like data aggregations and time series while at the same time take through basic python exercises that will help you learn how to apply Pandas to actual problems. This is one of the traits we look for in books or courses. We tend to need actual problems that we can apply our skills on. It makes it easier to frame and approach the problems.This book, as it states, really will take you through the “nuts and bolts of manipulating, processing, cleaning, and crunching data in Python”. These skills are crucial as a data scientist because most of your time will be focused on cleaning and processing data.
This is probably one of the few Pandas books we would recommend. There are many other data science and machine learning books we will also add to this list shortly. But there aren’t that many more books that are purely focused on Pandas that we would recommend.
The next important library that is worth learning about is Scikit-learn. This is the library that contains all the typical models you might read about (SVN, KNN, and all the other abbreviations)
Machine Learning with Python Cookbook: Practical Solutions from Preprocessing to Deep Learning.
Yes, another O’Reilly book. They tend to be our personal favorites because the coding examples are often much more practical than some basic tutorials. This book will start light with topics like Linear regression and KNN and then go into more deep learning concepts like neural networks.
Finally, we have one more book that we have read and actually enjoyed. If you would prefer learning about Tensorflow, then skip the book above and just use this one instead. It is slightly lighter on the machine learning (but still very thorough) but it also has a second half 100% dedicated to neural networks. It covers topics like convolutional neural networks, autoencoders, drop-out and other very important topics too deep learning with sensor flow.
There are some books we were not as big fans of. There is Data Science From Scratch.
This book is not really great to get an in-depth understanding of using python for data science or machine learning. It is more of a cheat sheet style book and it won’t be as in depth when it comes to covering practical examples. We were curious because of its the price but found it lackluster in the application.
Managing big data tools like Khafka, Hadoop and Cassandra aren’t often required for a data scientist. Especially if they work at a large company. However, we do know that some people would like to at least scratch the surface on these concepts and if you work for a startup, it would be worth knowing. For this post, we will only be covering Hadoop but we will be working on a follow-up the post that covers books like Spark, Cassandra, Apache Storm and more.
There are two books we like when it comes to Hadoop. The first is Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale. This is really a great book that covers just about every Hadoop topic you need to know. It starts out explaining Hadoop, map reduces and then starts going into very important subjects like file types, administration, scoop, and HBase. All very important subjects if you plan on setting up a Hadoop cluster. This book is very broad and is more for someone who wants to know about Hadoop so they can talk shop.
If, instead your goal is to really start designing Hadoop for big data, then tryDesigning Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. What they do really well in this book is to discuss design principals and trade-offs. What we find missed by many people who like the idea of Hadoop but have never used it is they don’t understand the trade-offs. Why would you prefer using a NoSQL database vs a relational database?
Do you really need the extra cost of having the skilled labor required to manage Hadoop? These thoughts sometimes get missed due to the hype, so we enjoyed the fact that this book took the time to cover this subject.
As far as SQL books go. We believe the more practical examples provided the better. Especially if they go beyond just trying to calculate the average purchase from a customer. So, in that case, we would recommend
Practical SQL : A Beginners Guide To Storytelling With Data
One point about this book is that it is for Postgresql. If you would prefer a more traditional database you could check out Learning SQL: Master SQL Fundamentals.
What we like about the Practical SQL book is the fact that it does a great job walking through the basics and then helping you apply those basics on some more complex problems. In addition, learning some Postgres along with watching a few videos on SQL Server or MySQL really will give you a great base. You will be able to start to differentiate the pros and cons of the various systems. This is less valuable as a data scientist but would be very important if you become an engineering manager who needs to make decisions like what database system you will base your entire company/team on.
We do hope this book helps you on your journey to being data scientists or just improving your data science skills. If there are books we missed, then please comment about them below. We are always looking for new books and opportunities to learn new topics.
For further reading and videos on data science, SQL and Python:How Algorithms Can Become Unethical and Biased
How To Load Multiple Files With SQL
How To Develop Robust Algorithms
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
SQL Best Practices — Designing An ETL Video
This post contains referral links from Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to earn fees by linking to Amazon.com and affiliated sites.