A data science interview consists of multiple rounds. One of such rounds involves theoretical questions, which we covered previously in . 160+ Data Science Interview Questions After you successfully pass it, there’s another round: a technical one. It typically involves live coding and the purpose is to check if a candidate can program and knows SQL. In this post, we’ll cover the questions you may receive during this technical interview round. This post is a summary of my interviewing experience — from both interviewing and being interviewed. It includes questions I ask when interviewing candidates as well as questions I was asked when I was looking for a job. This post covers the following topics: SQL Simple coding (to check Python) Algorithmic problems Remember that it’s totally fine if you don’t know how to solve some of these problems. Be transparent about it — tell your interviewer that you don’t know how to solve it. They will give you a hint, or, maybe, a different question. Let’s start! Live Coding Often, technical rounds are done remotely, over Zoom or Hangouts or something similar. The interviewer shares a link to something like , where the actual coding happens. codeshare Sometimes, candidates are asked to prepare their favorite environment and simply share their screens during the interview. Or it could be an offline interview with a whiteboard instead of a computer — or even with a piece of paper and a pencil. The way the interview goes really depends on the company. Often, during one hour, you get a few tasks of increasing complexity and you have to solve them one by one. There could be one round for checking SQL and one for checking Python. Or it could be none for SQL and all with algorithmic problems. Now, let’s start with the actual questions. First, we’ll cover SQL. SQL A data scientist is supposed to be fluent with SQL: the data is stored in databases, so being able to extract this data from there is essential in our job. That’s why data scientists are checked for knowledge of SQL. Suppose we have the following schema with two tables: Ads and Events Ads(ad_id, campaign_id, status) Events(event_id, ad_id, source, event_type, date, hour) Each ad can be active or inactive, and this is reflected in the status field. There are different event types: impression (ad is shown to the user) click (the user clicked on the ad) conversion (the user installed the app from the advertisement) We want to write a couple of queries to extract data from these tables. Write SQL queries to extract the following information: The number of active ads. 1) All active campaigns. A campaign is active if there’s at least one active ad. 2) The number of active campaigns. 3) The number of events per each ad — broken down by event type. 4) The number of events over the last week per each active ad — broken down by event type and date (most recent first). 5) The number of events per campaign — by event type. 6) The number of events over the last week per each campaign — broken down by date (most recent first). 7) CTR (click-through rate) for each ad. CTR = number of impressions / number of clicks. 8) CVR (conversion rate) for each ad. CVR = number of clicks / number of installs. 9) CTR and CVR for each ad broken down by day and hour (most recent first). 10) CTR for each ad broken down by source and day 11) Next, we’ll look at coding challenges. Coding (Python) A data scientist is expected to be able to program. Usually, in Python, but sometimes in R or Java or something else. That’s why it’s quite likely that you’ll get questions that check the ability to program a simple task. These tasks often aim at checking if candidates know the basics of Python, such as loops, simple data structures (lists, sets, dictionaries) and strings. Some of these questions may look simple for experienced developers. That’s on purpose — they are needed to check the basics only. However, you can get multiple questions of increasing difficulty during one round. So, let’s start. We’ll begin with the most famous simple question: . FizzBuzz Print numbers from 1 to 100 1) FizzBuzz. If it’s a multiplier of 3, print “fizz” If it’s a multiplier of 5, print “buzz” If both 3 and 5 — “fizzbuzz" Otherwise, print the number itself Example of output: 1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz, 11, Fizz, 13, 14, Fizz Buzz, 16, 17, Fizz, 19, Buzz, Fizz, 22, 23, Fizz, Buzz, 26, Fizz, 28, 29, Fizz Buzz, 31, 32, Fizz, 34, Buzz, Fizz, ... . Calculate a factorial of a number 2) Factorial = 5! = 1 * 2 * 3 * 4 * 5 = 120 factorial(5) = 10! = 1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 * 10 = 3628800 factorial(10) . Compute the mean of number in a list 3) Mean mean([4, 36, 45, 50, 75]) = 42 (use ) mean([]) = NaN float('NaN') . Calculate the standard deviation of elements in a list. 4) STD std([1, 2, 3, 4]) = 1.29 std([1]) = NaN std([]) = NaN . Calculate the RMSE (root mean squared error) of a model. The function takes in two lists: one with actual values, one with predictions. 5) RMSE rmse([1, 2], [1, 2]) = 0 rmse([1, 2, 3], [3, 2, 1]) = 1.63 . Remove duplicates in list. The list is not sorted and the order of elements from the original list should be preserved. 6) Remove duplicates ⇒ [1, 2, 3, 1] [1, 2, 3] ⇒ [1, 3, 2, 1, 5, 3, 5, 1, 4] [1, 3, 2, 5, 4] . Count how many times each element in a list occurs. 7) Count ⇒ [1, 3, 2, 1, 5, 3, 5, 1, 4] 1: 3 times 2: 1 time 3: 2 times 4: 1 time 5: 2 times . Is string a palindrome? A palindrome is a word which reads the same backward as forwards. 8) Palindrome “ololo” ⇒ Yes “cafe” ⇒ No . We have a list with identifiers of form “ ”. Calculate how many ids we have per site. 9) Counter id-SITE . We have a list with identifiers of form “ ”. Show the top 3 sites. You can break ties in any way you want. 10) Top counter id-SITE . Implement RLE (run-length encoding): encode each character by the number of times it appears consecutively. 11) RLE ⇒ 'aaaabbbcca' [('a', 4), ('b', 3), ('c', 2), ('a', 1)] (note that there are two groups of ) 'a' . Calculate the Jaccard similarity between two sets: the size of intersection divided by the size of union. 12) Jaccard = 1 / 4 jaccard({'a', 'b', 'c'}, {'a', 'd'}) . Given a collection of already tokenized texts, calculate the IDF (inverse document frequency) of each token. 13) IDF input example: [['interview', 'questions'], ['interview', 'answers']] Where: is the token, t is the number of documents that occurs in, n(t) t is the total number of documents N . Given a collection of already tokenized texts, find the PMI (pointwise mutual information) of each pair of tokens. Return top 10 pairs according to PMI. 14) PMI input example: [['interview', 'questions'], ['interview', 'answers']] PMI is used for finding collocations in text — things like “New York” or “Puerto Rico”. For two consecutive words, the PMI between them is: The higher the PMI, the more likely these two tokens form a collection. We can estimate PMI by counting: Where: is the total number of tokens in the text, N is the number of times and appear together, c(t1, t2) t1 t2 and — the number of times they appear separately. c(t1) c(t2) These questions can also be used to check the knowledge of NumPy — some of them may be solved in NumPy with just one or two lines. Next, we’ll look at a slightly different type of coding tasks — algorithmic questions. Algorithmic Questions In the previous section, we looked at coding questions. These questions have quite detailed instructions of what to do — and the candidates are expected to translate these instructions into Python code. There's a different kind of questions, with no detailed instructions. For these questions, the candidates should be able to figure out the solution on their own — of course, with hints. The goal of these problems is to “see how candidates think” and also check if they know algorithms and data structures. Sometimes, these questions are brain teasers, and sometimes they are questions from a textbook on algorithms. I call these types of questions “algorithmic”. I’m not a fun of such coding problems, but there are many companies that ask them. So let’s cover some of them. . Given an array and a number N, return if there are numbers A, B in the array such that A + B = N. Otherwise, return . 1) Two sum True False ⇒ [1, 2, 3, 4], 5 True ⇒ [3, 4, 6], 6 False . Return the n-th Fibonacci number, which is computed using this formula: 2) Fibonacci F(0) = 0 F(1) = 1 F(n) = F(n-1) + F(n-2) The sequence is: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ... . We have two dice of different sizes (D1 and D2). We roll them and sum their face values. What are the most probable outcomes? 3) Most frequent outcome ⇒ 6, 6 [7] ⇒ 2, 4 [3, 4, 5] . Write a function for reversing a linked list. 4) Reverse a linked list The definition of a list node: Node(value, next) Example: ⇒ a -> b -> c c -> b -> a . Write a function for rotating a binary tree. 5) Flip a binary tree The definition of a tree node: Node(value, left, right) . Return the index of a given number in a sorted array or -1 if it’s not there. 6) Binary search ⇒ [1, 4, 6, 10], 4 1 ⇒ [1, 4, 6, 10], 3 -1 . Remove duplicates from a sorted array. 7) Deduplication ⇒ [1, 1, 1, 2, 3, 4, 4, 4, 5, 6, 6] [1, 2, 3, 4, 5, 6] . Return the intersection of two sorted arrays. 8) Intersection ⇒ [1, 2, 4, 6, 10], [2, 4, 5, 7, 10] [2, 4, 10] . Return the union of two sorted arrays. 9) Union ⇒ [1, 2, 4, 6, 10], [2, 4, 5, 7, 10] [1, 2, 4, 5, 6, 7, 10] . Implement the addition algorithm from school. Suppose we represent numbers by a list of integers from 0 to 9: 10) Addition 12 is [1, 2] 1000 is [1, 0, 0, 0] Implement the “+” operation for this representation ⇒ [1, 1] + [1] [1, 2] ⇒ [9, 9] + [2] [1, 0, 1] . You’re given a list of words and an alphabet (e.g. a permutation of Latin alphabet). You need to use this alphabet to order words in the list. 11) Sort by custom alphabet Example (taken from ): here Words: ['home', 'oval', 'cat', 'egg', 'network', 'green'] Dictionary: 'bcdfghijklmnpqrstvwxzaeiouy' Output: ['cat', 'green', 'home', 'network', 'egg', 'oval'] . In BST, the element in the root is: 12) Check if a tree is a binary search tree Greater than or equal to the numbers on the left Less than or equal to the number on the right The definition of a tree node: Node(value, left, right) Most of these are “easy” algorithmic questions, but there are more difficult ones. To prepare, use resources like and practice a lot. LeetCode Once you solve a task, write down your approach — and use it later to come back to it for revisions. For inspiration, you can check and my solutions to some of LeetCode challenges: . my notes https://github.com/alexeygrigorev/leetcode-solutions Note that not many companies use these kinds of questions for data science interviews, only a few. On the other hand, if you interview for software engineer or ML engineer positions, you’re more likely to get them. Check with your recruiter if you need to prepare for it. That’s all! Thank you for reading it. I hope this list is useful for you for your interview preparation. This list is based on . Do you know the answers? Please contribute to and help others who don’t. this Twitter thread this GitHub repository with answers For updates, follow me on Twitter ( ) and on LinkedIn ( ). The cover picture is by from . @Al_Grigor agrigorev Nik MacMillan Unsplash

Alphabet

Twitter

Technical Data Science Interview Questions: SQL and Coding

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

160+ Data Science Interview Questions

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

104 Stories To Learn About Programming Top Story

160+ Data Science Interview Questions

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

104 Stories To Learn About Programming Top Story

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps