Over the past year, I’ve conducted around 50 interviews for MLE roles and noticed a clear trend: candidates often make similar mistakes. So, I thought I’d put together a quick guide on the areas to focus on and gain experience in. One of the biggest components of an MLE interview is ML System Design, so that’s where we’ll focus most of our attention.
Asking the right questions about business needs
Understanding them can save months of work in the wrong direction. There’s nothing more pointless than working hard on something useless.
Understanding of each step in system design and describe them in the correct order
Diving straight into describing model training and then realizing you forgot to ask about the available data might seem messy and out of order.
Justifying the choice of a particular method or technology
It’s especially important nowadays when everyone tries to use LLMs for everything, but if you can’t explain why this approach is the best (cost-effective, reliable, etc.), it might make you sound unprofessional.
The task is usually presented broadly from a business perspective, such as, “A b2b customer wants to automatically build clothing outfits for dogs on their marketplace”.
Instead of diving straight into solving the problem, it’s better to ask as many clarifying questions as possible.
What is the goal of this system? What’s the business metric we’re optimizing?
By agreeing early on the metric and the segment of users we’ll use to measure it, we can avoid disappointing clients who might expect millions of conversions from a feature that only a few people find useful.
What are the functional requirements?
What exactly should the service do for users? Should it recommend outfits for specific dog breeds or suggest options based on a photo of their lovely dog? You might think it’s not your job to ask these questions, but the more you understand the task, the better solution you can build.
Any restrictions on response time or resources (CPU/RAM)?
Real-time systems always require strict response times. High load greatly impacts the tools and models we choose. For instance, using huge generative models in a high-load system could lead to either high costs or cause clients to leave if they have to wait too long for a response.
No data, no model, so make sure to give it the attention it deserves.
What data is available?
Start by identifying what data is available for your project. For instance, an online store might have user behavior logs that track purchases, clicks, and browsing patterns. For example, when people buy clothes for their dogs on Halloween, they often purchase multiple items in one session. This information can help train a model to generate outfit suggestions.
If this information is not available, you can use data labeling or find open-source datasets. While human annotators can build a dataset, it’s very expensive. Thanks to generative models, you can use large models to create a quality dataset to train smaller, faster models.
How should the data be preprocessed?
Before diving into model training, preprocess the data to ensure it’s clean and structured correctly. This step may include handling missing values, normalizing or scaling features, and removing any irrelevant or duplicate data.
How to split data?
Consider the structure of your data. Avoid random splitting when there’s a time dependency. For example, training a model on this year’s stock prices and validating on last year’s would give inaccurate results.
See? Only after all those steps to clarify the details, we’re finally at model training. In the real world, it works the same way - you need to check everything carefully before diving into solutions.
What will be your baseline?
Don’t jump straight to the fanciest models. In the real world, it’s better to iterate quickly to test your hypotheses, so start with a solid baseline.
Which features we can use and how to represent them?
It’s essential to understand encoding methods for each type of data, such as categorical, numerical, text, and images, and how to combine them effectively. Many real-world tasks involve different modalities. For instance, a product details page in a marketplace may include text (title, description, reviews), images, categorical data (color, size), and numerical data (rating, price).
Which loss should you use?
Different tasks require different loss functions, so it’s important to understand each one’s weak spots and how to address them.
Which offline metric should we choose to best match business needs?
In experiments, offline metrics can help you find the best approach. They might not fully match business goals but can be a close guide. Pick different metrics to get a complete picture of the model’s quality.
How to handle imbalance if expected?
Imbalance is common in classification tasks. You can address it with oversampling, data augmentation, or weighted loss functions.
How to handle outliers and poor labeling quality?
In some cases, leaving outliers in the dataset is fine, as a robust model may learn to ignore them. You can also use anomaly detection or large generative models to check label quality.
How to handle changing distributions due to seasonality, global events, or holidays?
First, avoid training your model only on data from specific times like Black Friday or Christmas if you plan to use it year-round. This alone covers 80% of issues. Later, you can explore more advanced solutions if needed.
Once your model is trained, think about how inference will be handled most effectively.
How will inference be done?
This depends largely on the task: does it need to be real-time, instantly returning results to the customer, or is a wait acceptable? In real-time systems, processing usually happens item by item, whereas in offline settings, batching can speed up processing. In complex systems, you might blend both approaches. For instance, building user vectors and storing them in a feature store can be done offline, but retrieving these vectors and incorporating them into the model would occur in real-time for each customer.
What information should be precomputed and accessed during inference?
Precomputing can minimize latency. Typically, anything reusable should be precomputed. For example, user vectors for personalized search can be generated in advance. Although these vectors aren’t static, since users’ actions are continuously changing, adjusting the frequency of updates allows you to balance responsiveness with performance.
How to speed up inference?
Latency isn’t just about the model, other system components also play a significant role. Sometimes, optimizing these areas brings greater speed improvements than model tweaks alone. Along with precomputed components, caching entire user requests and responses can reduce the need for repeated execution of any model elements, further boosting efficiency.
Your model’s all trained up and ready to make customers happy, but how do you get it out of the cozy Jupyter Notebook and into the real world where people can actually use it?
How to serve the model?
This usually involves converting the model to ONNX, building a service with Flask or FastAPI, packaging it in Docker, and deploying on AWS, GCloud, or Kubernetes.
What other components are needed for the model to work?
In real-time systems such as ranking or recommendations, you might need to retrieve user data on the fly. For this, it’s helpful to know about feature stores and external data sources like Redis, DynamoDB, Postgres and S3.
How to test model before deploying?
While you can’t always fully measure model performance offline, you can still set up tests to ensure it meets basic guardrail metrics. For instance, you could check that precision/recall on a golden dataset doesn’t fall below a certain threshold. And, of course, remember to add unit and integration tests.
How can you update your model in production without downtime?
This can be done through deployment strategies like canary or rolling deployments, which allow you to gradually move traffic to the new model once it’s ready.
How to scale for high loads?
The most common approach is to scale horizontally, creating more instances of your service and using a load balancer to distribute the load.
Deploying your service is only halfway there. Unexpected issues will come up, so it’s important to have a way to be notified of any problems and react quickly. You can choose any platform for monitoring, though Prometheus is a popular option. Here are some of the most commonly tracked metrics (though this list isn’t exhaustive):
Request failure rate
The number of user requests that fail is a key metric, so set strict limits to catch problems early and respond quickly to avoid big losses.
Latency
In real-time systems, every millisecond matters because it affects the user experience and can impact revenue. It’s helpful to track p90, p95, p99 metrics to spot slowdowns early.
Resource utilization
If instances run out of memory or CPU usage gets too high, it can cause downtime and revenue loss. Scaling up quickly to handle increased traffic isn’t always easy, so it’s important to ensure there’s enough capacity to handle spikes, giving you time to scale when needed. The challenge is finding a balance between spending extra money on larger instances and risking resource shortages.
Often, before rolling out a new model to all users, it’s essential to conduct A/B tests to ensure the change won’t just avoid breaking anything but ideally boost key metrics.
Which metrics to calculate?
Define metrics based on business goals and ensure they cover two areas: primary metrics (e.g., click-through rate, conversion rate, revenue) that reflect the feature’s intended impact, and guardrail metrics that ensure no unintended harm to the user experience or business. Guardrail metrics act as safety checks and might trigger alerts or even stop the test if they fall below critical thresholds. For instance, while you are trying to boost purchases in an online store, a huge drop in revenue could signal a critical issue.
Which segment of users to choose for metrics calculation?
In addition to understanding how your system performs across all users (the global segment), focus on the segment most likely to show a measurable impact. For example, if you’ve implemented a responsive design feature that primarily affects mobile users, concentrate your metric calculations on this segment to increase the chances of detecting an effect.
Which statistical criteria to use, and how long to run the test?
Use a t-test when your data is normally distributed with equal variances, or Welch’s t-test when the variances between groups differ. Calculate the required sample size based on your significance level (e.g., 0.05) and desired power (e.g., 80%), then run the test until you reach this sample size.
I found that candidates tend to make common mistakes, and often it’s not just about their knowledge.
Lack of structure
Candidates often jump between topics, moving from models to data, then to deployment, without considering business needs. Following a clear structure, like the one I provide in mock and real interviews, can make your answers more organized and make it easier for the interviewer to follow your thought process.
Lack of confidence
Nerves can really impact performance. While I try to help candidates feel comfortable, other interviewers may not be as supportive. Regular mock interviews, ideally once a week with an experienced mentor, can help you build confidence and reduce anxiety.
Lack of broad knowledge
Being an expert in your field is crucial, but when transitioning to new tasks or roles, it’s important to have a basic understanding of the new area. Explore the company’s engineering blogs, courses, and research papers to broaden your knowledge and prepare more effectively.
If you only have a few days, try taking the “Grokking the Machine Learning Interview” course.
For a deeper dive, read “Designing Machine Learning Systems” by Chip Huyen.
Remember, there’s no silver bullet when it comes to acing interviews. Try out real interviews as early as possible to build your confidence and improve your skills.