Delight is a free, hosted, cross-platform monitoring dashboard for Apache Spark with memory and CPU metrics that will hopefully delight you! A year ago, we released a widely shared blog post: ! After a lot of engineering work, we’re proud to finally release , our free, hosted, cross-platform monitoring dashboard for . We’re building a better Spark UI Delight Apache Spark It works on top of Databricks, EMR, Dataproc, HDInsight, HDP/CDP, and open-source platforms too (anything spark-submit, or using the spark-on-kubernetes operator, or using Apache Livy). Delight helps you understand and improve the performance of your Spark applications. It gives you access to: CPU & Memory metrics aligned with your Spark jobs & stages The Spark UI itself (so you don't need to run the Spark History Server) Our vision and motivations for Delight It’s difficult to troubleshoot the performance of an application for the Spark UI. There’s useful information in the UI, but it’s buried under a lot of noise, and there’s a lot of tribal knowledge involved in understanding it. Moreover, the Spark UI does not show critical system metrics such as CPU Usage, Memory Usage, and I/O. For this, you have to use a general purpose observability tool (such as Ganglia). The problem with these tools is that they were not built for Spark. You’ll need to jump between metrics for each node, to try to get a rough estimate in your head, and then look at the timestamps to jump back and forth between them and the Spark UI. Our mission at is to make Apache Spark much more developer-friendly and cost-effective. This is why we built our managed platform (available on AWS, GCP, and Azure), which automates the infrastructure parameters and Spark configurations to make our customers pipelines stable and performant. Data Mechanics Spark-on-Kubernetes We also wanted to simplify the developer experience of data scientists and engineers, and the observability layer is a critical piece of this puzzle. We want to make it easy for anyone to understand the performance of their Spark code. This is why we set out to build Delight, an open-source monitoring dashboard for Spark. In June 2020, we shared featuring this . our vision for Delight in a widely shared blog post GIF Encouraged by hundreds of sign-ups and comments on our announcement, we set out to build Delight, which turned out to be a large-scale engineering effort. , we released an MVP consisting of a hosted dashboard giving you access to the Spark UI for your terminated applications (avoiding you the trouble to run and maintain your own Spark History Server). In November 2020 , we released Delight internally to our customers, and iterated on the feedback they gave, fixing many bugs and stability issues as we uncovered them. In Q1 2021 there still remains a lot of work to achieve our final vision. But Delight has already proven valuable to many, and so we’re excited to open it up to the Spark community! Today Architecture: Open-Source Agent + Hosted Backend Delight works for free on top of any Spark platform, including Databricks, Amazon EMR, Google Dataproc, or an open-source Spark-on-Kubernetes or Spark-on-YARN setup. It is compatible with running Spark applications using spark-submit, Apache Livy, or the Spark-on-Kubernetes operator. Delight consists of: An running inside your Spark applications, streaming Spark event metrics to the Delight backend. open-sourced agent A hosted backend responsible for collecting these metrics, parsing them, and hosting the dashboard at https://delight.datamechanics.co The metrics collected by Delight contain low-level metadata about your Spark application execution, like how much data is read/written by each task, how much memory & cpu is used. These metrics are called Spark event logs ( ), and they’re the same information that lets the Spark History Server play back the Spark UI of completed applications. view an example These metrics are protected using a secret access token you can generate in our dashboard, which uniquely identifies you. They are not shared with any third party, and they are automatically deleted from our servers after 30 days. An additional limit of 1,000 applications stored per organization applies (if you run more than 1,000 applications within 30 days, we’ll start deleting the oldest ones). Taking a tour of the main screens 1. The Main Dashboard When you log in to Delight, you see this table of the most recent Spark applications you ran. Note that live applications are not visible yet, they only appear a couple of minutes after their completion. Most of the information in this screen should be self-explanatory, except 3 columns: Spark tasks, CPU, and Efficiency. Before explaining them, here’s a quick reminder of how Spark distributed execution works. process (the “brain” of your application), and many processes. The driver is responsible for understanding your code and splitting it into , and . A stage is a set of tasks which can be executed in parallel on your executors. One executor core can run (at most) one Spark task at a time ; in other words, a Spark executor with 4 CPUs can run up to 4 tasks in parallel. In a Spark application, you have a single driver executor jobs stages tasks Let’s now define three columns: . This is the sum of the duration of all the tasks in your application. Spark Tasks Side-note: This is the metric that the Data Mechanics platform pricing is based on. . This is the sum of the lifetime duration of your Spark executors, multiplied by how many cores are allocated to each executors. In other words, this is your “Executor Cores uptime”. This metric is generally proportional to your infrastructure costs. CPU . This is the ratio Spark Tasks divided by CPU, expressed as a percentage. This represents the fraction of the time that your CPU resources are effectively used to run Spark tasks. 100% is ideal (perfect parallelism), 0% means your compute resources are wasted (you have containers running, but they’re idle). Efficiency This is one of the key metric we look at at Data Mechanics, as our platform has automation features (autoscaling, autotuning, …) to increase our customers efficiency, and therefore reduce their cloud costs (check this EMR to learn more about this). story of a customer who migrated away from 2. Executor Cores Usage Once you click on a specific application, you get to an app overview page with high-level statistics and a link to the Spark UI at the top. The next information you will find is the executor cores usage graph. The executor cores usage graph aggregates low-level CPU metrics measured by Spark during your application execution, into a simple visual overview. You first get a summary of your executor cores usage, broken down by the main categories: — Time spent actually computing the Spark task. CPU — Time spent waiting for data to be read or written (usually from network). I/O Note: Spark currently treats work done in a Python process as I/O (from the JVM point of view, this is I/O waiting for python to provide data). This happens for example when you execute python User-Defined Functions. Check this StackOverflow post to better understand how the JVM and Python work together in PySpark. — Time spent waiting for data blocks to be written or fetched during shuffle (data exchange phases across your Spark executors). Shuffle — Time spent doing by the JVM doing GC. For most (healthy) applications, this shouldn’t be more than a few percentage points. Garbage Collection — Time spent by Spark on overhead operations such as scheduler delay, task serialization and deserialization. Again, this should be small for most apps. Spark Internals — Time during which tasks are executed, but only on a portion of your executor cores. A high value there is the sign of imperfect parallelism. Some Cores Idle — Time during which no Spark tasks are running. This could be full idleness, which happens often e.g. when you use Spark in an interactive way, then take a break. Or this could mean that some non-Spark work is happening, for example you may be executing pure Python or Scala code on the driver, so all the executors are idle. All Cores Idle This information is then visible on a graph where the X-Axis is the timeline of your app, the Y-axis is the number of executor cores you have (which may vary over time if you enable dynamic allocation). Under this graph, you can see a timeline of your Spark jobs and stages. The main benefit of this graph is that you can very quickly focus your attention on the main performance bottleneck of your application. In this example, it’s clear that take up most of the application time. The large gray area indicates that most of the cores are idle during this time. If you click on “stage-1”, you will be taken to the Spark UI, so you can troubleshoot this further (in this example, the problem is that the stage only has 2 tasks). job-1/stage-1 3. Executors Peak Memory Usage This graph is only visible if you use Spark 3.0 or later, as it depends on new capabilities introduced in Spark 3.0 ( and ). SPARK-23429 SPARK-27189 While your app is running, Spark measures the memory usage of each executor. This graph reports the peak memory usage observed for your top 5 executors, broken down between different categories: Memory used by the (this is your maximum heap size) JVM Memory used by processes (if you’re using PySpark) Python Memory used by processes (if you’re using R, then R is currently listed as “Others”) other This graph should be particularly useful to PySpark users, who currently have no simple way of knowing how much memory their python processes consume. This graph could help you decide whether it’s safe to use a different type of instance / size of container than you currently use. It could also warn you that you are flirting with the maximum allowed memory usage. As a reminder, if your Python processes use too much memory, then your resource manager will kill your Spark executor (for example on Kubernetes, you will get the docker exit code 137). Conclusion: Try it out, and let us know your feedback! We hope Delight will be true to its word and help you have a delightful developer experience with Spark. We encourage you to try it out! , follow the installation instructions , and let us know your feedback over email (by replying to the welcome email) or using the live chat window in the product. Sign up on our github page We have ambitious plans for Delight. Our future roadmap includes: A page for each executor, and for the driver, to track their memory usage over time, with more detailed breakdown (Offheap memory usage) Automated performance tuning recommendations (e.g. alerts on data skew / inefficient data partitioning, …) Making Delight accessible in real-time while the app is running And more based on your feedback: let us know what you'd like to see! We will soon be publishing follow-up articles with concrete troubleshooting steps and Spark performance tuning sessions using Delight. We will also be giving meetups and conferences about it, including at the upcoming ). So stay tuned, and thank you for your support! Data + AI Summit Originally published at https://www.datamechanics.co .