Effective metrics and monitoring play a key role in high-quality development, bug fixes, and handling of user requests and incidents. At Social Discovery Group, we employ a diverse range of tools to evaluate the performance of our products.
In this article, we will share some hacks of how we managed to build a stable application monitoring and metrics system and reduced bug detection time on our products with the help of Datadog.
SDG products are helping more than 250 mln people connect and build relationships globally, and our user base is constantly growing. The main factor in this success is our ability to promptly respond to user needs. We have a lot of experience working with various monitoring systems, including Datadog. Here's why:
Currently, Datadog offers a variety of integration options with systems at different levels. You can explore these capabilities here.
Datadog provides comprehensive documentation for configuring integrations with various services.
We managed to build strong partnership relations.
Here, we're sharing our experience in setting up and configuring APM metrics for applications within our Kubernetes cluster for Datadog. The article won't cover the deployment of projects in AKS, the CI/CD process, and other DevOps details.
Instead, we will focus on the finer points of configuring Datadog monitoring for APM metrics.
The technology stack used: Azure Services, Azure Kubernetes Service (AKS), ASP .Net Core 7, Datadog.
For monitoring applications and services, we utilize the Datadog agent deployed within the cluster through a Helm chart with additional parameters from the values.yaml file. In the main agent configuration file, you'll need to enable the DogStatsD module and specify the port (default is 8125).
For the data sent from an external host, the Datadog agent requires the following configuration: dogstatsd_non_local_traffic: true and apm_non_local_traffic: true. Here is the values.yaml file for one of the clusters, with some variables being passed at the deployment stage. The deployment is produced with Azure Devops.
"
datadog:
apiKey: #{apiKey}#
appKey: #{appKey}#
clusterName: #{ClusterName}#
kubeStateMetricsEnabled: true
clusterChecks:
enabled: true
dogstatsd:
useSocketVolume: false
nonLocalTraffic: true
collectEvents: false
apm:
portEnabled: true #important to include
env:
- name: "DD_KUBELET_TLS_VERIFY"
value: "false"
systemProbe:
collectDNSStats: false
orchestratorExplorer:
enabled: false
clusterAgent:
image:
repository: public.ecr.aws/datadog/cluster-agent
tag: #{tag}#
admissionController:
enabled: false
agents:
image:
repository: public.ecr.aws/datadog/agent
tag: #{tag}#
doNotCheckTag: true
clusterChecksRunner:
image:
repository: public.ecr.aws/datadog/agent
tag: #{tag}#
"
Then, you need to specify the agent's address for transmitting metrics to Datadog in the application settings. The dashboards for app monitoring are based on the internally used metrics; they had to be created independently.
To set up monitoring of ASP.Net services, we used the official documentation that can be found by the link.
Since the agent was already configured, one of the methods was to add the necessary lines to the image build and pass variables in the CI/CD system: DD_ENV, DD_SERVICE, DD_AGENT_HOST, to specify the environment, service name, and agent address, respectively. We also need to add the following to the dockerfiles for the services:
"
RUN TRACER_VERSION=$(curl -s \https://api.github.com/repos/DataDog/dd-trace-dotnet/releases/latest | grep tag_name | cut -d '"' -f 4 | cut -c2-) \
&& curl -Lo /tmp/datadog-dotnet-apm.deb https://github.com/DataDog/dd-trace-dotnet/releases/download/v${TRACER_VERSION}/datadog-dotnet-apm_${TRACER_VERSION}_amd64.deb
# Copy the tracer from build target
COPY --from=build /tmp/datadog-dotnet-apm.deb /tmp/datadog-dotnet-apm.deb
# Install the tracer
RUN mkdir -p /opt/datadog \
&& mkdir -p /var/log/datadog \
&& dpkg -i /tmp/datadog-dotnet-apm.deb \
&& rm /tmp/datadog-dotnet-apm.deb
# Enable the tracer
ENV CORECLR_ENABLE_PROFILING=1
ENV CORECLR_PROFILER={846F5F1C-F9AE-4B07-969E-05C26BC060D8}
ENV CORECLR_PROFILER_PATH=/opt/datadog/Datadog.Trace.ClrProfiler.Native.so
ENV DD_DOTNET_TRACER_HOME=/opt/datadog
ENV DD_INTEGRATIONS=/opt/datadog/integrations.json
"
This method works, but it didn't seem like the most optimal solution. We decided to take it a step further and added the following to the deployments of our services:
"
metadata.labels:
tags.datadoghq.com/env: feature
tags.datadoghq.com/service: service_name
tags.datadoghq.com/version: '1552'
spec.template.metadata.labels:
admission.datadoghq.com/config.mode: service
admission.datadoghq.com/enabled: 'true'
tags.datadoghq.com/env: feature
tags.datadoghq.com/service: service_name
tags.datadoghq.com/version: '1111'
spec.template.metadata.annotations:
admission.datadoghq.com/dotnet-lib.version: v2.38.0
spec.template.spec.containers.name.env:
- name: DD_TRACE_AGENT_URL
value: datadog-agent.monitoring
- name: DD_TRACE_STARTUP_LOGS
value: 'true'
- name: DD_LOGS_INJECTION
value: 'true'
- name: DD_RUNTIME_METRICS_ENABLED
value: 'true'
- name: DD_PROFILING_ENABLED
value: 'true'
- name: DD_APPSEC_ENABLED
value: 'true'
"
That’s what was changed in the agents:
"
datadog:
apm:
socketEnabled: true
portEnabled: true
enabled: true
clusterAgent:
admissionController:
enabled: true
mutateUnlabelled: false
providers:
aks:
enabled: true
"
After all these steps, the data with highly detailed metrics for all services started flowing in Datadog under the APM -> Services section, and the graphs were automatically displayed.
We had to tinker with the annotations settings for the second method; not everything started working smoothly right away.
Regarding the notification system, it's worth mentioning that it is user-friendly and intuitive in Datadog. Notifications are created in the "Monitors -> Manage Monitors" section.
The improvements we described above yielded several valuable outcomes. We now have a deeper understanding of how our system operates and adapts to various changes.
Additionally, we've established a stable application monitoring and metrics system that operates independently of service builds, helping to reduce bug detection times.
This, in turn, has allowed us to optimize our services and improve development speed and overall system quality.