Nate at StrataScratch
Hi, it’s Nate here with a new video on 5 coding concepts that companies will test you on in data science interviews in 2021. As you guys know, I’m always tracking and collecting interview questions from various companies to better understand what technical concepts you need to know to succeed in interviews and what skills you need to have to succeed on the job. So these are questions that I’ve collected in the first few months of 2021. So they’re as real-time as can be.
The companies that I’m sourcing from today are Facebook, Postmates, Credit Karma, Ring Central, and Twitch. I’ll introduce the coding concepts to know and then show you a real interview question I collected from the company.
So let’s get started.
Categorizations, aggregations, ratios
Let’s start simple -- categorizations, aggregations, ratios -- (aren’t these 3 different concepts?) -- Yes, but they’re almost always tested together in one question. And it’s probably the most commonly tested concept on coding interviews, especially in the beginning rounds.
By categorizations, I mean CASE statements in SQL. I’ll stick with SQL technical concepts because I see most people use SQL on coding interviews.
Aggregations like taking the sum or average are always tested and usually are applied after the categorization takes place.
But they can also take the form of a ratio.
Here’s an example by Facebook (https://platform.stratascratch.com/coding-question?id=2005&python=):
Return share of monthly active users in the United States (US). Active users are ones with “open” status in the table.
Here’s what the solution would look like:
SELECT active_users /total_users: : float AS active_users_share FROM (SELECT count (user_id) total_users, Count (CASE WHEN status = 'open' THEN 1 ELSE NULL END) AS active_users FROM fb_active_users WHERE country = 'USA') subq
If you read this solution and it makes sense to you then congrats, you’re on the right track.
So as you can see, to solve this question, you need to understand how to properly write a case statement, properly aggregate the results of the case statement, and then take a ratio from the result of the aggregation itself. All in one query.
The tricky part isn’t really even those 3 technical concepts. It’s organizing your code into logical steps -- as you can see there’s a subquery to split out the logic.
And there’s also a data type conversion from an integer to a float so that your output can be between 0 and 1.
These are the little nuances you need to know in addition to these 3 concepts. Lots of people forget these small things, especially if you’re out of practice or nervous.
JOINs, subqueries, and CTEs. Again, 3 different concepts but they’re always tested together to solve one problem.
Here’s an example of what I mean. This question is from Credit Karma (https://platform.stratascratch.com/coding-question?id=2003&python=)
Write a query that joins this submissions table to the loans table and returns the total loan balance on each user’s most recent ‘Refinance’ submission.
Balance Per User on Recent Refinance Submission
Write a query that joins this submissions table to the loans table and returns the total loan balance on each user’s most recent “Refinance” submission.
Here’s the SQL code solution:
SELECT l.user_id, SUM (balance) FROM (SELECT DISTINCT user_id, Min(created_at) OVER (PARTTION BY user_id, TYPE)
Looks pretty complicated doesn’t it? It even includes other advanced concepts I’m not even covering. But you can see there’s a JOIN statement and a subquery (which could be re-written as a CTE, if you wanted to do that).
What’s being tested is if you can blend two sources of data together and aggregate the data from one table, “balance” with a variable from another table “user_id”. Remember, a successful interviewee will talk through the entire logic and why they’re doing what they’re doing.
While we’re talking about subqueries, they can be found in multiple places -- not just at the JOIN statement like in the last question. Subqueries in the WHERE clause are also really common concepts tested and actually, it’s a question that most people get wrong.
Here’s a question from Ring Central (https://platform.stratascratch.com/coding-question?id=2017&python=)
How many paid users had any calls in Apr 2020?
Here’s the solution:
SELECT count (DISTINCT user_id) FROM rc_calls WHERE date BETWEEN “2020-04-01” AND “2020-04-30” AND user_id in (SELECT user_id FROM rc_users WHERE status = 'paid')
Subqueries in the WHERE clause allow you to quickly grab information from one table and use it as a filtering condition in the main query, like we are seeing here. So in this example, we’re grabbing the user_id of paid users from the user table and using that as a condition to find how many calls they placed in the calls table.
In my experience, not many people would solve this problem in this way. Let me know if you have another approach to this problem. I’ll create a video in the next few weeks that covers all the approaches to solving this problem.
The 4th coding concept tested in data science interviews so far in 2021 are window functions. Window functions are heavily utilized in data science work. I’ve talked about them extensively in past videos. One example of this is a question from Twitch.
Streamers Sessions By Users Who Had First Session As Viewer
From the users who had their session as a viewer, how many streamer sessions have they had? Return users id and number of sessions in
descending order. In case there are users with the same number of sessions, order them by ascending user id.
Here’s the solution:
SELECT user_id, count (*) n_sessions FROM twitch_sessions WHERE session_type = 'streamer' AND user_id in (SELECT user_id FROM (SELECT user_id, session_type, rank() OVER (PARTITION BY user_id ORDER BY session_start) streams_order FROM twitch_sessions) s1 WHERE streams_order = 1 AND session_type = 'viewer' ) GROUP BY user_id ORDER BY n_sessions DESC, User_id ASC
This one’s insane. And yes it did get asked in an interview. But don’t worry, in the upcoming weeks, I’ll be stepping through the solution for each of the questions I just covered.
These questions take a lot of time to solve and they’re typically found in the later rounds (1) because they’re so much code but (2) because there’s so much involved that it just takes time to talk through it.
Lastly, date manipulations. This one is always asked on interviews because everyone in analytics from an analyst to a data scientist works with dates. It would be weird not to. Because you’re usually given a complete day, year-month-day-timestamp, but asked to aggregate by weeks, months, or years, you’ll need to know how to manipulate dates.
Here’s one from Postmates (https://platform.stratascratch.com/coding-question?id=2014&python=)
Which hour of the day averages the highest order volume?
Here’s the solution:
SELECT date_part ('hour', order_timestamp_utc) AS HOUR, Count (id) n_orders FROM posmates_orders GROUP BY date_part ('hour', order_timestamp_utc) ORDER BY count(id) DESC
This one requires you to extract the hour from the datetime in order to find the highest order volume. So it’s really just a matter of knowing these datetime functions, like date_part().
And as you probably have guessed, I have a few videos on that too. Check them out in the description.
Did you think these questions were difficult to answer? If so, I’m going to step through all the solutions in my next Youtube videos.
I also have a link to all the questions in the video and resources I’ve put together on these technical topics.
Remember, these are interview questions that were asked in the first few weeks/months of 2021, so many companies are definitely requiring you to master these 5 technical concepts.
I hope this was helpful. Let’s start with these 5 questions in the next few videos on my YouTube channel and see how we can solve them together.
Until then guys, keep practicing and learning. Hope this is helpful. Watch the video here:
Create your free account to unlock your custom reading experience.