How to Stand Out in Machine Learning Interviews: A Framework for ML System Design

Over the past year, I’ve conducted around 50 interviews for MLE roles and noticed a clear trend: candidates often make similar mistakes. So, I thought I’d put together a quick guide on the areas to focus on and gain experience in. One of the biggest components of an MLE interview is ML System Design, so that’s where we’ll focus most of our attention. What exactly do I want to hear from a candidate? Asking the right questions about business needs Understanding them can save months of work in the wrong direction. There’s nothing more pointless than working hard on something useless. Understanding of each step in system design and describe them in the correct order Diving straight into describing model training and then realizing you forgot to ask about the available data might seem messy and out of order. Justifying the choice of a particular method or technology It’s especially important nowadays when everyone tries to use LLMs for everything, but if you can’t explain why this approach is the best (cost-effective, reliable, etc.), it might make you sound unprofessional. What are the steps involved in designing an ML system? The task is usually presented broadly from a business perspective, such as, “A b2b customer wants to automatically build clothing outfits for dogs on their marketplace”. Clarifying the Task Instead of diving straight into solving the problem, it’s better to ask as many clarifying questions as possible. What is the goal of this system? What’s the business metric we’re optimizing? By agreeing early on the metric and the segment of users we’ll use to measure it, we can avoid disappointing clients who might expect millions of conversions from a feature that only a few people find useful. What are the functional requirements? What exactly should the service do for users? Should it recommend outfits for specific dog breeds or suggest options based on a photo of their lovely dog? You might think it’s not your job to ask these questions, but the more you understand the task, the better solution you can build. Any restrictions on response time or resources (CPU/RAM)? Real-time systems always require strict response times. High load greatly impacts the tools and models we choose. For instance, using huge generative models in a high-load system could lead to either high costs or cause clients to leave if they have to wait too long for a response. Data No data, no model, so make sure to give it the attention it deserves. What data is available? Start by identifying what data is available for your project. For instance, an online store might have user behavior logs that track purchases, clicks, and browsing patterns. For example, when people buy clothes for their dogs on Halloween, they often purchase multiple items in one session. This information can help train a model to generate outfit suggestions. If this information is not available, you can use data labeling or find open-source datasets. While human annotators can build a dataset, it’s very expensive. Thanks to generative models, you can use large models to create a quality dataset to train smaller, faster models. How should the data be preprocessed? Before diving into model training, preprocess the data to ensure it’s clean and structured correctly. This step may include handling missing values, normalizing or scaling features, and removing any irrelevant or duplicate data. How to split data? Consider the structure of your data. Avoid random splitting when there’s a time dependency. For example, training a model on this year’s stock prices and validating on last year’s would give inaccurate results. Model Training See? Only after all those steps to clarify the details, we’re finally at model training. In the real world, it works the same way - you need to check everything carefully before diving into solutions. What will be your baseline? Don’t jump straight to the fanciest models. In the real world, it’s better to iterate quickly to test your hypotheses, so start with a solid baseline. Which features we can use and how to represent them? It’s essential to understand encoding methods for each type of data, such as categorical, numerical, text, and images, and how to combine them effectively. Many real-world tasks involve different modalities. For instance, a product details page in a marketplace may include text (title, description, reviews), images, categorical data (color, size), and numerical data (rating, price). Which loss should you use? Different tasks require different loss functions, so it’s important to understand each one’s weak spots and how to address them. Which offline metric should we choose to best match business needs? In experiments, offline metrics can help you find the best approach. They might not fully match business goals but can be a close guide. Pick different metrics to get a complete picture of the model’s quality. How to handle imbalance if expected? Imbalance is common in classification tasks. You can address it with oversampling, data augmentation, or weighted loss functions. How to handle outliers and poor labeling quality? In some cases, leaving outliers in the dataset is fine, as a robust model may learn to ignore them. You can also use anomaly detection or large generative models to check label quality. How to handle changing distributions due to seasonality, global events, or holidays? First, avoid training your model only on data from specific times like Black Friday or Christmas if you plan to use it year-round. This alone covers 80% of issues. Later, you can explore more advanced solutions if needed. Model Inference Once your model is trained, think about how inference will be handled most effectively. How will inference be done? This depends largely on the task: does it need to be real-time, instantly returning results to the customer, or is a wait acceptable? In real-time systems, processing usually happens item by item, whereas in offline settings, batching can speed up processing. In complex systems, you might blend both approaches. For instance, building user vectors and storing them in a feature store can be done offline, but retrieving these vectors and incorporating them into the model would occur in real-time for each customer. What information should be precomputed and accessed during inference? Precomputing can minimize latency. Typically, anything reusable should be precomputed. For example, user vectors for personalized search can be generated in advance. Although these vectors aren’t static, since users’ actions are continuously changing, adjusting the frequency of updates allows you to balance responsiveness with performance. How to speed up inference? Latency isn’t just about the model, other system components also play a significant role. Sometimes, optimizing these areas brings greater speed improvements than model tweaks alone. Along with precomputed components, caching entire user requests and responses can reduce the need for repeated execution of any model elements, further boosting efficiency. Model Deployment Your model’s all trained up and ready to make customers happy, but how do you get it out of the cozy Jupyter Notebook and into the real world where people can actually use it? How to serve the model? This usually involves converting the model to ONNX, building a service with Flask or FastAPI, packaging it in Docker, and deploying on AWS, GCloud, or Kubernetes. What other components are needed for the model to work? In real-time systems such as ranking or recommendations, you might need to retrieve user data on the fly. For this, it’s helpful to know about feature stores and external data sources like Redis, DynamoDB, Postgres and S3. How to test model before deploying? While you can’t always fully measure model performance offline, you can still set up tests to ensure it meets basic guardrail metrics. For instance, you could check that precision/recall on a golden dataset doesn’t fall below a certain threshold. And, of course, remember to add unit and integration tests. How can you update your model in production without downtime? This can be done through deployment strategies like canary or rolling deployments, which allow you to gradually move traffic to the new model once it’s ready. How to scale for high loads? The most common approach is to scale horizontally, creating more instances of your service and using a load balancer to distribute the load. Service Monitoring Deploying your service is only halfway there. Unexpected issues will come up, so it’s important to have a way to be notified of any problems and react quickly. You can choose any platform for monitoring, though Prometheus is a popular option. Here are some of the most commonly tracked metrics (though this list isn’t exhaustive): Request failure rate The number of user requests that fail is a key metric, so set strict limits to catch problems early and respond quickly to avoid big losses. Latency In real-time systems, every millisecond matters because it affects the user experience and can impact revenue. It’s helpful to track p90, p95, p99 metrics to spot slowdowns early. Resource utilization If instances run out of memory or CPU usage gets too high, it can cause downtime and revenue loss. Scaling up quickly to handle increased traffic isn’t always easy, so it’s important to ensure there’s enough capacity to handle spikes, giving you time to scale when needed. The challenge is finding a balance between spending extra money on larger instances and risking resource shortages. Calculating Online Metrics Often, before rolling out a new model to all users, it’s essential to conduct A/B tests to ensure the change won’t just avoid breaking anything but ideally boost key metrics. Which metrics to calculate? Define metrics based on business goals and ensure they cover two areas: primary metrics (e.g., click-through rate, conversion rate, revenue) that reflect the feature’s intended impact, and guardrail metrics that ensure no unintended harm to the user experience or business. Guardrail metrics act as safety checks and might trigger alerts or even stop the test if they fall below critical thresholds. For instance, while you are trying to boost purchases in an online store, a huge drop in revenue could signal a critical issue. Which segment of users to choose for metrics calculation? In addition to understanding how your system performs across all users (the global segment), focus on the segment most likely to show a measurable impact. For example, if you’ve implemented a responsive design feature that primarily affects mobile users, concentrate your metric calculations on this segment to increase the chances of detecting an effect. Which statistical criteria to use, and how long to run the test? Use a t-test when your data is normally distributed with equal variances, or Welch’s t-test when the variances between groups differ. Calculate the required sample size based on your significance level (e.g., 0.05) and desired power (e.g., 80%), then run the test until you reach this sample size. Top mistakes candidates make during interviews I found that candidates tend to make common mistakes, and often it’s not just about their knowledge. Lack of structure Candidates often jump between topics, moving from models to data, then to deployment, without considering business needs. Following a clear structure, like the one I provide in mock and real interviews, can make your answers more organized and make it easier for the interviewer to follow your thought process. Lack of confidence Nerves can really impact performance. While I try to help candidates feel comfortable, other interviewers may not be as supportive. Regular mock interviews, ideally once a week with an experienced mentor, can help you build confidence and reduce anxiety. Lack of broad knowledge Being an expert in your field is crucial, but when transitioning to new tasks or roles, it’s important to have a basic understanding of the new area. Explore the company’s engineering blogs, courses, and research papers to broaden your knowledge and prepare more effectively. What should you read or go through before an interview? If you only have a few days, try taking the “Grokking the Machine Learning Interview” course. For a deeper dive, read “Designing Machine Learning Systems” by Chip Huyen. Additional resources ML monitoring metrics How to monitor models in production Google’s best practices for ML Conclusion Remember, there’s no silver bullet when it comes to acing interviews. Try out real interviews as early as possible to build your confidence and improve your skills. Over the past year, I’ve conducted around 50 interviews for MLE roles and noticed a clear trend: candidates often make similar mistakes. So, I thought I’d put together a quick guide on the areas to focus on and gain experience in. One of the biggest components of an MLE interview is ML System Design, so that’s where we’ll focus most of our attention. What exactly do I want to hear from a candidate? What exactly do I want to hear from a candidate? Asking the right questions about business needs Understanding them can save months of work in the wrong direction. There’s nothing more pointless than working hard on something useless. Understanding of each step in system design and describe them in the correct order Diving straight into describing model training and then realizing you forgot to ask about the available data might seem messy and out of order. Justifying the choice of a particular method or technology It’s especially important nowadays when everyone tries to use LLMs for everything, but if you can’t explain why this approach is the best (cost-effective, reliable, etc.), it might make you sound unprofessional. Asking the right questions about business needs Understanding them can save months of work in the wrong direction. There’s nothing more pointless than working hard on something useless. Asking the right questions about business needs Asking the right questions about business needs Understanding them can save months of work in the wrong direction. There’s nothing more pointless than working hard on something useless. Understanding of each step in system design and describe them in the correct order Diving straight into describing model training and then realizing you forgot to ask about the available data might seem messy and out of order. Understanding of each step in system design and describe them in the correct order Understanding of each step in system design and describe them in the correct order Diving straight into describing model training and then realizing you forgot to ask about the available data might seem messy and out of order. Justifying the choice of a particular method or technology It’s especially important nowadays when everyone tries to use LLMs for everything, but if you can’t explain why this approach is the best (cost-effective, reliable, etc.), it might make you sound unprofessional. Justifying the choice of a particular method or technology Justifying the choice of a particular method or technology It’s especially important nowadays when everyone tries to use LLMs for everything, but if you can’t explain why this approach is the best (cost-effective, reliable, etc.), it might make you sound unprofessional. What are the steps involved in designing an ML system? What are the steps involved in designing an ML system? The task is usually presented broadly from a business perspective, such as, “A b2b customer wants to automatically build clothing outfits for dogs on their marketplace” . “A b2b customer wants to automatically build clothing outfits for dogs on their marketplace” Clarifying the Task Clarifying the Task Instead of diving straight into solving the problem, it’s better to ask as many clarifying questions as possible . as many clarifying questions as possible What is the goal of this system? What’s the business metric we’re optimizing? By agreeing early on the metric and the segment of users we’ll use to measure it, we can avoid disappointing clients who might expect millions of conversions from a feature that only a few people find useful. What are the functional requirements? What exactly should the service do for users? Should it recommend outfits for specific dog breeds or suggest options based on a photo of their lovely dog? You might think it’s not your job to ask these questions, but the more you understand the task, the better solution you can build. Any restrictions on response time or resources (CPU/RAM)? Real-time systems always require strict response times. High load greatly impacts the tools and models we choose. For instance, using huge generative models in a high-load system could lead to either high costs or cause clients to leave if they have to wait too long for a response. What is the goal of this system? What’s the business metric we’re optimizing? By agreeing early on the metric and the segment of users we’ll use to measure it, we can avoid disappointing clients who might expect millions of conversions from a feature that only a few people find useful. What is the goal of this system? What’s the business metric we’re optimizing? What is the goal of this system? What’s the business metric we’re optimizing? By agreeing early on the metric and the segment of users we’ll use to measure it, we can avoid disappointing clients who might expect millions of conversions from a feature that only a few people find useful. What are the functional requirements? What exactly should the service do for users? Should it recommend outfits for specific dog breeds or suggest options based on a photo of their lovely dog? You might think it’s not your job to ask these questions, but the more you understand the task, the better solution you can build. What are the functional requirements? What are the functional requirements? What exactly should the service do for users? Should it recommend outfits for specific dog breeds or suggest options based on a photo of their lovely dog? You might think it’s not your job to ask these questions, but the more you understand the task, the better solution you can build. Any restrictions on response time or resources (CPU/RAM)? Real-time systems always require strict response times. High load greatly impacts the tools and models we choose. For instance, using huge generative models in a high-load system could lead to either high costs or cause clients to leave if they have to wait too long for a response. Any restrictions on response time or resources (CPU/RAM)? Any restrictions on response time or resources (CPU/RAM)? Real-time systems always require strict response times. High load greatly impacts the tools and models we choose. For instance, using huge generative models in a high-load system could lead to either high costs or cause clients to leave if they have to wait too long for a response. always Data Data No data, no model , so make sure to give it the attention it deserves. No data, no model What data is available? Start by identifying what data is available for your project. For instance, an online store might have user behavior logs that track purchases, clicks, and browsing patterns. For example, when people buy clothes for their dogs on Halloween, they often purchase multiple items in one session. This information can help train a model to generate outfit suggestions. If this information is not available, you can use data labeling or find open-source datasets. While human annotators can build a dataset, it’s very expensive. Thanks to generative models, you can use large models to create a quality dataset to train smaller, faster models. How should the data be preprocessed? Before diving into model training, preprocess the data to ensure it’s clean and structured correctly. This step may include handling missing values, normalizing or scaling features, and removing any irrelevant or duplicate data. How to split data? Consider the structure of your data. Avoid random splitting when there’s a time dependency. For example, training a model on this year’s stock prices and validating on last year’s would give inaccurate results. What data is available? Start by identifying what data is available for your project. For instance, an online store might have user behavior logs that track purchases, clicks, and browsing patterns. For example, when people buy clothes for their dogs on Halloween, they often purchase multiple items in one session. This information can help train a model to generate outfit suggestions. If this information is not available, you can use data labeling or find open-source datasets. While human annotators can build a dataset, it’s very expensive. Thanks to generative models, you can use large models to create a quality dataset to train smaller, faster models. What data is available? What data is available? Start by identifying what data is available for your project. For instance, an online store might have user behavior logs that track purchases, clicks, and browsing patterns. For example, when people buy clothes for their dogs on Halloween, they often purchase multiple items in one session. This information can help train a model to generate outfit suggestions. If this information is not available, you can use data labeling or find open-source datasets. While human annotators can build a dataset, it’s very expensive. Thanks to generative models, you can use large models to create a quality dataset to train smaller, faster models. How should the data be preprocessed? Before diving into model training, preprocess the data to ensure it’s clean and structured correctly. This step may include handling missing values, normalizing or scaling features, and removing any irrelevant or duplicate data. How should the data be preprocessed? How should the data be preprocessed? Before diving into model training, preprocess the data to ensure it’s clean and structured correctly. This step may include handling missing values, normalizing or scaling features, and removing any irrelevant or duplicate data. How to split data? Consider the structure of your data. Avoid random splitting when there’s a time dependency. For example, training a model on this year’s stock prices and validating on last year’s would give inaccurate results. How to split data? How to split data? Consider the structure of your data. Avoid random splitting when there’s a time dependency. For example, training a model on this year’s stock prices and validating on last year’s would give inaccurate results. Model Training Model Training See? Only after all those steps to clarify the details, we’re finally at model training. In the real world, it works the same way - you need to check everything carefully before diving into solutions. What will be your baseline? Don’t jump straight to the fanciest models. In the real world, it’s better to iterate quickly to test your hypotheses, so start with a solid baseline. Which features we can use and how to represent them? It’s essential to understand encoding methods for each type of data, such as categorical, numerical, text, and images, and how to combine them effectively. Many real-world tasks involve different modalities. For instance, a product details page in a marketplace may include text (title, description, reviews), images, categorical data (color, size), and numerical data (rating, price). Which loss should you use? Different tasks require different loss functions, so it’s important to understand each one’s weak spots and how to address them. Which offline metric should we choose to best match business needs? In experiments, offline metrics can help you find the best approach. They might not fully match business goals but can be a close guide. Pick different metrics to get a complete picture of the model’s quality. How to handle imbalance if expected? Imbalance is common in classification tasks. You can address it with oversampling, data augmentation, or weighted loss functions. How to handle outliers and poor labeling quality? In some cases, leaving outliers in the dataset is fine, as a robust model may learn to ignore them. You can also use anomaly detection or large generative models to check label quality. How to handle changing distributions due to seasonality, global events, or holidays? First, avoid training your model only on data from specific times like Black Friday or Christmas if you plan to use it year-round. This alone covers 80% of issues. Later, you can explore more advanced solutions if needed. What will be your baseline? Don’t jump straight to the fanciest models. In the real world, it’s better to iterate quickly to test your hypotheses, so start with a solid baseline. What will be your baseline? What will be your baseline? Don’t jump straight to the fanciest models. In the real world, it’s better to iterate quickly to test your hypotheses, so start with a solid baseline. Which features we can use and how to represent them? It’s essential to understand encoding methods for each type of data, such as categorical, numerical, text, and images, and how to combine them effectively. Many real-world tasks involve different modalities. For instance, a product details page in a marketplace may include text (title, description, reviews), images, categorical data (color, size), and numerical data (rating, price). Which features we can use and how to represent them? Which features we can use and how to represent them? It’s essential to understand encoding methods for each type of data, such as categorical, numerical, text, and images, and how to combine them effectively. Many real-world tasks involve different modalities. For instance, a product details page in a marketplace may include text (title, description, reviews), images, categorical data (color, size), and numerical data (rating, price). Which loss should you use? Different tasks require different loss functions, so it’s important to understand each one’s weak spots and how to address them. Which loss should you use? Which loss should you use? Different tasks require different loss functions, so it’s important to understand each one’s weak spots and how to address them. Which offline metric should we choose to best match business needs? In experiments, offline metrics can help you find the best approach. They might not fully match business goals but can be a close guide. Pick different metrics to get a complete picture of the model’s quality. Which offline metric should we choose to best match business needs? Which offline metric should we choose to best match business needs? In experiments, offline metrics can help you find the best approach. They might not fully match business goals but can be a close guide. Pick different metrics to get a complete picture of the model’s quality. How to handle imbalance if expected? Imbalance is common in classification tasks. You can address it with oversampling, data augmentation, or weighted loss functions. How to handle imbalance if expected? How to handle imbalance if expected? Imbalance is common in classification tasks. You can address it with oversampling, data augmentation, or weighted loss functions. How to handle outliers and poor labeling quality? In some cases, leaving outliers in the dataset is fine, as a robust model may learn to ignore them. You can also use anomaly detection or large generative models to check label quality. How to handle outliers and poor labeling quality? How to handle outliers and poor labeling quality? In some cases, leaving outliers in the dataset is fine, as a robust model may learn to ignore them. You can also use anomaly detection or large generative models to check label quality. How to handle changing distributions due to seasonality, global events, or holidays? First, avoid training your model only on data from specific times like Black Friday or Christmas if you plan to use it year-round. This alone covers 80% of issues. Later, you can explore more advanced solutions if needed. How to handle changing distributions due to seasonality, global events, or holidays? How to handle changing distributions due to seasonality, global events, or holidays? First, avoid training your model only on data from specific times like Black Friday or Christmas if you plan to use it year-round. This alone covers 80% of issues. Later, you can explore more advanced solutions if needed. Model Inference Model Inference Once your model is trained, think about how inference will be handled most effectively. How will inference be done? This depends largely on the task: does it need to be real-time, instantly returning results to the customer, or is a wait acceptable? In real-time systems, processing usually happens item by item, whereas in offline settings, batching can speed up processing. In complex systems, you might blend both approaches. For instance, building user vectors and storing them in a feature store can be done offline, but retrieving these vectors and incorporating them into the model would occur in real-time for each customer. What information should be precomputed and accessed during inference? Precomputing can minimize latency. Typically, anything reusable should be precomputed. For example, user vectors for personalized search can be generated in advance. Although these vectors aren’t static, since users’ actions are continuously changing, adjusting the frequency of updates allows you to balance responsiveness with performance. How to speed up inference? Latency isn’t just about the model, other system components also play a significant role. Sometimes, optimizing these areas brings greater speed improvements than model tweaks alone. Along with precomputed components, caching entire user requests and responses can reduce the need for repeated execution of any model elements, further boosting efficiency. How will inference be done? This depends largely on the task: does it need to be real-time, instantly returning results to the customer, or is a wait acceptable? In real-time systems, processing usually happens item by item, whereas in offline settings, batching can speed up processing. In complex systems, you might blend both approaches. For instance, building user vectors and storing them in a feature store can be done offline, but retrieving these vectors and incorporating them into the model would occur in real-time for each customer. How will inference be done? How will inference be done? This depends largely on the task: does it need to be real-time, instantly returning results to the customer, or is a wait acceptable? In real-time systems, processing usually happens item by item, whereas in offline settings, batching can speed up processing. In complex systems, you might blend both approaches. For instance, building user vectors and storing them in a feature store can be done offline, but retrieving these vectors and incorporating them into the model would occur in real-time for each customer. What information should be precomputed and accessed during inference? Precomputing can minimize latency. Typically, anything reusable should be precomputed. For example, user vectors for personalized search can be generated in advance. Although these vectors aren’t static, since users’ actions are continuously changing, adjusting the frequency of updates allows you to balance responsiveness with performance. What information should be precomputed and accessed during inference? What information should be precomputed and accessed during inference? Precomputing can minimize latency. Typically, anything reusable should be precomputed. For example, user vectors for personalized search can be generated in advance. Although these vectors aren’t static, since users’ actions are continuously changing, adjusting the frequency of updates allows you to balance responsiveness with performance. How to speed up inference? Latency isn’t just about the model, other system components also play a significant role. Sometimes, optimizing these areas brings greater speed improvements than model tweaks alone. Along with precomputed components, caching entire user requests and responses can reduce the need for repeated execution of any model elements, further boosting efficiency. How to speed up inference? How to speed up inference? Latency isn’t just about the model, other system components also play a significant role. Sometimes, optimizing these areas brings greater speed improvements than model tweaks alone. Along with precomputed components, caching entire user requests and responses can reduce the need for repeated execution of any model elements, further boosting efficiency. Model Deployment Model Deployment Your model’s all trained up and ready to make customers happy, but how do you get it out of the cozy Jupyter Notebook and into the real world where people can actually use it? How to serve the model? This usually involves converting the model to ONNX, building a service with Flask or FastAPI, packaging it in Docker, and deploying on AWS, GCloud, or Kubernetes. What other components are needed for the model to work? In real-time systems such as ranking or recommendations, you might need to retrieve user data on the fly. For this, it’s helpful to know about feature stores and external data sources like Redis, DynamoDB, Postgres and S3. How to test model before deploying? While you can’t always fully measure model performance offline, you can still set up tests to ensure it meets basic guardrail metrics. For instance, you could check that precision/recall on a golden dataset doesn’t fall below a certain threshold. And, of course, remember to add unit and integration tests. How can you update your model in production without downtime? This can be done through deployment strategies like canary or rolling deployments, which allow you to gradually move traffic to the new model once it’s ready. How to scale for high loads? The most common approach is to scale horizontally, creating more instances of your service and using a load balancer to distribute the load. How to serve the model? This usually involves converting the model to ONNX, building a service with Flask or FastAPI, packaging it in Docker, and deploying on AWS, GCloud, or Kubernetes. How to serve the model? How to serve the model? This usually involves converting the model to ONNX, building a service with Flask or FastAPI, packaging it in Docker, and deploying on AWS, GCloud, or Kubernetes. What other components are needed for the model to work? In real-time systems such as ranking or recommendations, you might need to retrieve user data on the fly. For this, it’s helpful to know about feature stores and external data sources like Redis, DynamoDB, Postgres and S3. What other components are needed for the model to work? What other components are needed for the model to work? In real-time systems such as ranking or recommendations, you might need to retrieve user data on the fly. For this, it’s helpful to know about feature stores and external data sources like Redis, DynamoDB, Postgres and S3. How to test model before deploying? While you can’t always fully measure model performance offline, you can still set up tests to ensure it meets basic guardrail metrics. For instance, you could check that precision/recall on a golden dataset doesn’t fall below a certain threshold. And, of course, remember to add unit and integration tests. How to test model before deploying? How to test model before deploying? While you can’t always fully measure model performance offline, you can still set up tests to ensure it meets basic guardrail metrics. For instance, you could check that precision/recall on a golden dataset doesn’t fall below a certain threshold. And, of course, remember to add unit and integration tests. How can you update your model in production without downtime? This can be done through deployment strategies like canary or rolling deployments, which allow you to gradually move traffic to the new model once it’s ready. How can you update your model in production without downtime? How can you update your model in production without downtime? This can be done through deployment strategies like canary or rolling deployments, which allow you to gradually move traffic to the new model once it’s ready. How to scale for high loads? The most common approach is to scale horizontally, creating more instances of your service and using a load balancer to distribute the load. How to scale for high loads? How to scale for high loads? The most common approach is to scale horizontally, creating more instances of your service and using a load balancer to distribute the load. Service Monitoring Service Monitoring Deploying your service is only halfway there. Unexpected issues will come up, so it’s important to have a way to be notified of any problems and react quickly. You can choose any platform for monitoring, though Prometheus is a popular option. Here are some of the most commonly tracked metrics (though this list isn’t exhaustive): Request failure rate The number of user requests that fail is a key metric, so set strict limits to catch problems early and respond quickly to avoid big losses. Latency In real-time systems, every millisecond matters because it affects the user experience and can impact revenue. It’s helpful to track p90, p95, p99 metrics to spot slowdowns early. Resource utilization If instances run out of memory or CPU usage gets too high, it can cause downtime and revenue loss. Scaling up quickly to handle increased traffic isn’t always easy, so it’s important to ensure there’s enough capacity to handle spikes, giving you time to scale when needed. The challenge is finding a balance between spending extra money on larger instances and risking resource shortages. Request failure rate The number of user requests that fail is a key metric, so set strict limits to catch problems early and respond quickly to avoid big losses. Request failure rate Request failure rate The number of user requests that fail is a key metric, so set strict limits to catch problems early and respond quickly to avoid big losses. Latency In real-time systems, every millisecond matters because it affects the user experience and can impact revenue. It’s helpful to track p90, p95, p99 metrics to spot slowdowns early. Latency Latency In real-time systems, every millisecond matters because it affects the user experience and can impact revenue. It’s helpful to track p90, p95, p99 metrics to spot slowdowns early. Resource utilization If instances run out of memory or CPU usage gets too high, it can cause downtime and revenue loss. Scaling up quickly to handle increased traffic isn’t always easy, so it’s important to ensure there’s enough capacity to handle spikes, giving you time to scale when needed. The challenge is finding a balance between spending extra money on larger instances and risking resource shortages. Resource utilization Resource utilization If instances run out of memory or CPU usage gets too high, it can cause downtime and revenue loss. Scaling up quickly to handle increased traffic isn’t always easy, so it’s important to ensure there’s enough capacity to handle spikes, giving you time to scale when needed. The challenge is finding a balance between spending extra money on larger instances and risking resource shortages. Calculating Online Metrics Calculating Online Metrics Often, before rolling out a new model to all users, it’s essential to conduct A/B tests to ensure the change won’t just avoid breaking anything but ideally boost key metrics. Which metrics to calculate? Define metrics based on business goals and ensure they cover two areas: primary metrics (e.g., click-through rate, conversion rate, revenue) that reflect the feature’s intended impact, and guardrail metrics that ensure no unintended harm to the user experience or business. Guardrail metrics act as safety checks and might trigger alerts or even stop the test if they fall below critical thresholds. For instance, while you are trying to boost purchases in an online store, a huge drop in revenue could signal a critical issue. Which segment of users to choose for metrics calculation? In addition to understanding how your system performs across all users (the global segment), focus on the segment most likely to show a measurable impact. For example, if you’ve implemented a responsive design feature that primarily affects mobile users, concentrate your metric calculations on this segment to increase the chances of detecting an effect. Which statistical criteria to use, and how long to run the test? Use a t-test when your data is normally distributed with equal variances, or Welch’s t-test when the variances between groups differ. Calculate the required sample size based on your significance level (e.g., 0.05) and desired power (e.g., 80%), then run the test until you reach this sample size. Which metrics to calculate? Define metrics based on business goals and ensure they cover two areas: primary metrics (e.g., click-through rate, conversion rate, revenue) that reflect the feature’s intended impact, and guardrail metrics that ensure no unintended harm to the user experience or business. Guardrail metrics act as safety checks and might trigger alerts or even stop the test if they fall below critical thresholds. For instance, while you are trying to boost purchases in an online store, a huge drop in revenue could signal a critical issue. Which metrics to calculate? Which metrics to calculate? Define metrics based on business goals and ensure they cover two areas: primary metrics (e.g., click-through rate, conversion rate, revenue) that reflect the feature’s intended impact, and guardrail metrics that ensure no unintended harm to the user experience or business. Guardrail metrics act as safety checks and might trigger alerts or even stop the test if they fall below critical thresholds. For instance, while you are trying to boost purchases in an online store, a huge drop in revenue could signal a critical issue. primary metrics guardrail metrics Which segment of users to choose for metrics calculation? In addition to understanding how your system performs across all users (the global segment), focus on the segment most likely to show a measurable impact. For example, if you’ve implemented a responsive design feature that primarily affects mobile users, concentrate your metric calculations on this segment to increase the chances of detecting an effect. Which segment of users to choose for metrics calculation? Which segment of users to choose for metrics calculation? In addition to understanding how your system performs across all users (the global segment), focus on the segment most likely to show a measurable impact. For example, if you’ve implemented a responsive design feature that primarily affects mobile users, concentrate your metric calculations on this segment to increase the chances of detecting an effect. Which statistical criteria to use, and how long to run the test? Use a t-test when your data is normally distributed with equal variances, or Welch’s t-test when the variances between groups differ. Calculate the required sample size based on your significance level (e.g., 0.05) and desired power (e.g., 80%), then run the test until you reach this sample size. Which statistical criteria to use, and how long to run the test? Which statistical criteria to use, and how long to run the test? Use a t-test when your data is normally distributed with equal variances, or Welch’s t-test when the variances between groups differ. Calculate the required sample size based on your significance level (e.g., 0.05) and desired power (e.g., 80%), then run the test until you reach this sample size. t-test Welch’s t-test Top mistakes candidates make during interviews Top mistakes candidates make during interviews I found that candidates tend to make common mistakes, and often it’s not just about their knowledge. Lack of structure Candidates often jump between topics, moving from models to data, then to deployment, without considering business needs. Following a clear structure, like the one I provide in mock and real interviews, can make your answers more organized and make it easier for the interviewer to follow your thought process. Lack of confidence Nerves can really impact performance. While I try to help candidates feel comfortable, other interviewers may not be as supportive. Regular mock interviews, ideally once a week with an experienced mentor, can help you build confidence and reduce anxiety. Lack of broad knowledge Being an expert in your field is crucial, but when transitioning to new tasks or roles, it’s important to have a basic understanding of the new area. Explore the company’s engineering blogs, courses, and research papers to broaden your knowledge and prepare more effectively. Lack of structure Candidates often jump between topics, moving from models to data, then to deployment, without considering business needs. Following a clear structure, like the one I provide in mock and real interviews, can make your answers more organized and make it easier for the interviewer to follow your thought process. Lack of structure Lack of structure Candidates often jump between topics, moving from models to data, then to deployment, without considering business needs. Following a clear structure, like the one I provide in mock and real interviews, can make your answers more organized and make it easier for the interviewer to follow your thought process. Lack of confidence Nerves can really impact performance. While I try to help candidates feel comfortable, other interviewers may not be as supportive. Regular mock interviews, ideally once a week with an experienced mentor, can help you build confidence and reduce anxiety. Lack of confidence Lack of confidence Nerves can really impact performance. While I try to help candidates feel comfortable, other interviewers may not be as supportive. Regular mock interviews, ideally once a week with an experienced mentor, can help you build confidence and reduce anxiety. Lack of broad knowledge Being an expert in your field is crucial, but when transitioning to new tasks or roles, it’s important to have a basic understanding of the new area. Explore the company’s engineering blogs, courses, and research papers to broaden your knowledge and prepare more effectively. Lack of broad knowledge Lack of broad knowledge Being an expert in your field is crucial, but when transitioning to new tasks or roles, it’s important to have a basic understanding of the new area. Explore the company’s engineering blogs, courses, and research papers to broaden your knowledge and prepare more effectively. What should you read or go through before an interview? What should you read or go through before an interview? If you only have a few days , try taking the “ Grokking the Machine Learning Interview ” course. a few days Grokking the Machine Learning Interview For a deeper dive , read “ Designing Machine Learning Systems ” by Chip Huyen. For a deeper dive Designing Machine Learning Systems Additional resources Additional resources ML monitoring metrics How to monitor models in production Google’s best practices for ML ML monitoring metrics ML monitoring metrics How to monitor models in production How to monitor models in production Google’s best practices for ML Google’s best practices for ML Conclusion Conclusion Remember, there’s no silver bullet when it comes to acing interviews. Try out real interviews as early as possible to build your confidence and improve your skills.