In our blog about ‘Is SQL Needed to be a Data Scientist?, we discussed the importance of SQL and basics of how knowledge of SQL helps with a successful data science career. In this blog, let us take that forward and discuss the same in more detail.
if you have some basic knowledge of SQL and are preparing for an interview, read our list of SQL Interview Questions, which will help get you the job. However, before getting into that, it will be useful to know why and how SQL holds such an important stature in the field of data science.
Data science is a field that uses many scientific methods and processes to collect data and get useful insights from them. Doing this creates new opportunities, designs, business cases and even revolutionary innovations.
The future of AI lies in Data science. Data science involves use of algorithms, software programs and statistics to collate and process huge complex chunks of data. The complexity of data comes from its unstructured nature – data today can be in any form – text, image, output from sensors and instruments and what not!
As of today, the benefits of data science can be felt in domains like e-commerce, healthcare, manufacturing, banking, finance and transport.
To learn about SQL, we need to understand how a DBMS works. DBMS or Database Management System is essentially a software to create and manage databases. With DBMS, programmers can create, manage or select data.
Database acts as an interface between end users and application programs to access data in an organized and consistent manner. For example, a user logs in to Facebook.
Based on the user’s registered email address and password, API calls are made from the application to the database and user details are retrieved.
In a relational DBMS or RDBMS, SQL is the API through which users and applications access data. Data can be accessed in the form of tables and its columns.
The relationships between the tables is such that there is no need to reorganize the data or the tables. Using SQL statements, we can get information about multiple tables and their relationships.
SQL or Structured Query Language is a ‘programming language’ that manages data in a relational database through ‘queries’. By using SQL, we can insert, update, delete and select data based on various filters and conditions. SQL also helps run certain scheduled scripts automatically, that need to run regularly.
With SQL, Developers and data analysts can easily code and run queries. For example, select * from customer; is a simple query that gives details of the customer table.
Analysts can access huge amounts of data, process it without having to copy or store data anywhere else.
It is easy to analyze data using SQL queries than using excel sheets or any other methods. There are a lot of aggregation functions in SQL that can work on large datasets and multiple tables.
To understand how SQL is used in data science, we need to know the different phases of data science.
There are 3 important phases in the lifecycle of Data Science –
After discovery of data from various sources, data has to be prepared and structured. That means, all the redundant and duplicate data should be removed and a neat structure should be formed so that it is easy to model the data. This is called DBMS normalization and all relational databases follow this approach.
With SQL, structuring and conditioning of the data can be done easily because SQL has the concept of primary key and constraints, wherein the redundancy and dependency can be removed and proper relationship can be established between different data types.
SQL can also help with data cleaning, which means irrelevant, incorrect or corrupt records can be removed periodically by automatic batch scripts without any manual intervention. You can also use R for data cleaning and transformation, however using SQL in conjunction with R can reduce the amount of code to be written and bring in more flexibility.
Model planning and building totally relies on DBMS (Database Management Systems), particularly Structured query Language or SQL (if you did not know it already!). Big data is a big chunk of unstructured data and only a structured language can save our time trying to establish relationships between the variables of presented data.
A simple example of modelling will be a customer who is trying to purchase a postpaid mobile connection with a plan, subscription and a few value-added services.
Each of these can be a table in the relational DBMS and the relationship between them can be established using a common key like the customer_id, phone_number or subscription_id which can be unique. Thus, we are creating entities (tables), attributes (columns), relationships (customer_id linked to subscription_id or phone_number etc…) and integrities (customer_id should be unique, phone_number and subscription_id together can be a composite key and so on).
We can also create triggers, for example, whenever the customer changes any value-added services, a trigger enables the service to be activated automatically after performing the necessary eligibility checks.
Data mining is a process where huge sets of data is examined for patterns and trends for evaluating data to solve a business purpose. The process becomes easier when large datasets are organized and structured.
As a continuation, let us say we want to identify what are the value-added services that most users are interested in. Since every person nowadays has at least one mobile phone and the number of services that can be added are unlimited, this will be a huge data to consolidate, sort and examine.
SQL makes things easier by giving particular records based on specific user query. For example, if we want to know the users who have activated ‘ringtone’ services during calls, we can write a query as –
select customer_id from value_added_service table where service = ‘ringtone’ and is_service_enabled = ‘Y’ and service_end_date > NOW();
For complex queries indexing can be done to fetch results faster.
We can also use in-built SQL functions to list the data in ascending order based on service_end_date, group by a certain order and so on.
You might want to argue that excel sheet with huge amount of data can also fetch the results and do some filtering operations as well. It is true, but not comparable to the scale at which SQL stores its data. But how much SQL is enough for you to be a data scientist?
There is no definite answer, however we have tried to squeeze in a lot of information that will help you grasp the basic idea. If you already know SQL, you will be familiar with most of it, if you are not, don’t worry, this blog will give you good insights of the basics.
Most of the time, the database and schema will be already present with all the relationships established. As a data scientist, you will need to understand the schema, relationships and how you can query the database to get the best of the results from a business perspective.
Let us create some data and then work our way to fetch this data using the different techniques we will learn below.
Below are the most important statements and commands that you will use on a daily basis.
Selecting rows and columns
The basic query that fetches all the data from a table is –
select * from table_name;
Let us get some records from student table.
Doing
select * from student;
will fetch us all the rows and columns present in the table.
What if we only wanted the first name and the date of birth? Then we should specify the column names in the query.
select first_name, date_of_birth from student;
first name date_of_birth
-------------------------------------------
Maria 17-11-1984
Gopal 04-02-1983
Keshav 14-12-1983
Aaron 05-03-1984
Naman 06-06-1984
We can also use some expressions to display the data in a particular way. For example,
select first_name + ‘ ‘ + last_name as full_name from student;
You will get something like –
full_name
------------
Maria Thomas
Gopal Chandra
Keshav Sharma
Aaron V
Naman K
Filtering rows based on conditions
Most of the times, we need only particular trends and patterns. These patterns can be our filters, based on which we will get only those records from database which are relevant and satisfy our conditions. For example,
If we want to get students of the teacher Brunda only, we can use a where clause.
select first_name, date_of_birth from student where teacher_name = ‘Brunda’;
It is also easy and efficient to use grouping and sorting functions. To start with, if we want to group the records based on the deptt, we can add a group by clause –
select first_name, date_of_birth from student where adm_year = 2001 group by deptt;
‘Having’ is another filter similar to the where clause which we can apply on group functions. For example, we want to have all the records of ECE dept with admission date after 2001, we can write the query as –
select first_name, deptt, date_of_birth from student group by deptt having adm_year > 2001;
This will fetch something like –
first_name deptt date_of_birth
Maria ECE 17-11-1984
Naman ECE 06-06-1984
Keshav CSE 14-12-1983
It can so happen that there may be similar records and we want to get only one of them. For example, the deptt (departments) column. If we want to list all the departments present in the database, we can use the keyword distinct. This will give all the unique values in the column.
Same way, we can get the count of students based on certain conditions, find sum, average, find records that have a particular date of birth and many more things.
Aggregate functions are used with group by, having clause to get a scalar value from a select statement. The most common functions are – SUM, AVG, COUNT, MAX, MIN. Except count function, all others ignore a NULL value. In the above example, if we want to get count of students in ECE deptt, we can form a query as –
select count(*) from student where deptt = ‘ECE’;
Similarly, if we want to get the average percentage obtained by CSE students of all grades, we can use the AVG function and so on.
A relational database, as we know has multiple small tables that can be mapped to each other. To obtain data from multiple tables in a single result, we join the tables and display the required columns. There are different types of joins –
If you would like to get into details of each or any of the above, this interactive course caters to all the SQL that is required for you to be a data scientist and all that we have discussed above.
Microsoft has come up with a nice tool that a lot of organizations use to analyze the huge data spread across multiple databases. Abbreviated as SSAS, this online analytical processing and data mining tool includes a lot of services like modeling, integration, reporting and analysis. For more information, check the information from the official microsoft page here.
There are many sub fields in data science, but irrespective of that, SQL remains an important ingredient in the Data science dish. Without the essence of SQL, your entry into this field would be incomplete. Learning and implementing SQL will go a long way in helping you think of more creative ideas and turn your data into useful business use cases or insights.