Scaling PHP Symfony Metrics at 200k RPM: 50+ Servers, Zero Overhead with UDP + Telegraf

“Redis dies at 200k RPM, Prometheus can’t scrape 50 servers in time, and the business demands real-time dashboards. Sounds familiar?” “Redis dies at 200k RPM, Prometheus can’t scrape 50 servers in time, and the business demands real-time dashboards. Sounds familiar?” Friday, 6:00 PM. Grafana shows timeouts while scraping metrics. Redis, used by prometheus_client_php, eats 8GB of RAM and 100% CPU. Prometheus fails to scrape all 50+ servers within the 15-second window. And Black Friday launches on Monday… prometheus_client_php This article is about how we switched from a pull to a push model for PHP monitoring in a highload project, why we chose UDP + Telegraf over the classical approach, and how we now collect metrics from 50+ servers without a single timeout. pull push UDP + Telegraf Architecture: Pull vs Push for PHP Metrics Architecture: Pull vs Push for PHP Metrics Why Prometheus PHP Client Doesn’t Always Work for Highload Why Prometheus PHP Client Doesn’t Always Work for Highload A typical scenario: you run a PHP Symfony application and need metrics. The first idea — prometheus_client_php. Great library, but with caveats: prometheus_client_php // Classic prometheus_client_php usage $registry = new CollectorRegistry(new Redis()); $counter = $registry->getOrRegisterCounter('app', 'requests_total', 'Total requests'); $counter->inc(['method' => 'GET', 'endpoint' => '/api/users']); // Classic prometheus_client_php usage $registry = new CollectorRegistry(new Redis()); $counter = $registry->getOrRegisterCounter('app', 'requests_total', 'Total requests'); $counter->inc(['method' => 'GET', 'endpoint' => '/api/users']); What happens under the hood: What happens under the hood: Each metric is stored in Redis/APC/in-memory storagePrometheus periodically scrapes the /metrics endpointOn scrape, all metrics are read from storage Each metric is stored in Redis/APC/in-memory storage Prometheus periodically scrapes the /metrics endpoint On scrape, all metrics are read from storage Where problems begin: Where problems begin: Scaling: With 50+ servers, Prometheus must scrape each. This becomes a bottleneck.Storage: Redis adds latency; APC works only per server; in-memory dies on FPM restarts.Configuration: You must set up service discovery for all servers.Performance: At 200k RPM, each Redis call for counter increment = overhead. Scaling: With 50+ servers, Prometheus must scrape each. This becomes a bottleneck. Scaling: Storage: Redis adds latency; APC works only per server; in-memory dies on FPM restarts. Storage: Configuration: You must set up service discovery for all servers. Configuration: Performance: At 200k RPM, each Redis call for counter increment = overhead. Performance: The Solution: Push Model with UDP for PHP Highload Monitoring The Solution: Push Model with UDP for PHP Highload Monitoring Instead, we send metrics via UDP to Telegraf, which then forwards them to Prometheus, InfluxDB, or others. UDP Telegraf Why UDP? Fire & forget: No waiting for responses, no timeouts.Minimal overhead: Microsecond delivery.Fault tolerance: If Telegraf crashes, the app keeps running.Simplicity: No connection pools, retries, or circuit breakers. Fire & forget: No waiting for responses, no timeouts. Fire & forget: Minimal overhead: Microsecond delivery. Minimal overhead: Fault tolerance: If Telegraf crashes, the app keeps running. Fault tolerance: Simplicity: No connection pools, retries, or circuit breakers. Simplicity: Important: UDP may lose packets, but losing 0.01% metrics won’t distort dashboards. Important: TelegrafMetricsBundle: Implementation TelegrafMetricsBundle: Implementation All of this is packaged in a Symfony bundle — TelegrafMetricsBundle — for sending metrics over UDP. TelegrafMetricsBundle Installation Installation composer require yakovlef/telegraf-metrics-bundle composer require yakovlef/telegraf-metrics-bundle Config (config/packages/telegraf_metrics.yaml): telegraf_metrics: namespace: 'my_app' client: url: 'http://localhost:8086' udpPort: 8089 telegraf_metrics: namespace: 'my_app' client: url: 'http://localhost:8086' udpPort: 8089 Bundle Architecture Bundle Architecture Three core components: // MetricsCollectorInterface - DI contract interface MetricsCollectorInterface { public function collect(string $name, array $fields, array $tags = []): void; } // Implementation with UDP Writer - implemented via InfluxDB UDP Writer class MetricsCollector implements MetricsCollectorInterface { private UdpWriter $writer; public function __construct(Client $client, string $namespace) { $this->writer = $client->createUdpWriter(); } public function collect(string $name, array $fields, array $tags = []): void { // Send metric in InfluxDB $this->writer->write( new Point("{$this->namespace}_$name", $tags, $fields) ); } } // MetricsCollectorInterface - DI contract interface MetricsCollectorInterface { public function collect(string $name, array $fields, array $tags = []): void; } // Implementation with UDP Writer - implemented via InfluxDB UDP Writer class MetricsCollector implements MetricsCollectorInterface { private UdpWriter $writer; public function __construct(Client $client, string $namespace) { $this->writer = $client->createUdpWriter(); } public function collect(string $name, array $fields, array $tags = []): void { // Send metric in InfluxDB $this->writer->write( new Point("{$this->namespace}_$name", $tags, $fields) ); } } DI integration: services: Yakovlef\TelegrafMetricsBundle\Collector\MetricsCollectorInterface: '@telegraf_metrics.collector' services: Yakovlef\TelegrafMetricsBundle\Collector\MetricsCollectorInterface: '@telegraf_metrics.collector' Practical Use Cases Practical Use Cases 1. API Endpoint Monitoring 1. API Endpoint Monitoring class ApiController { public function __construct( private MetricsCollectorInterface $metrics ) {} public function getUsers(): JsonResponse { $startTime = microtime(true); try { $users = $this->userRepository->findAll(); $responseTime = (microtime(true) - $startTime) * 1000; $this->metrics->collect('api_request', [ 'response_time' => $responseTime, 'count' => 1 ], [ 'endpoint' => '/api/users', 'method' => 'GET', 'status' => '200' ]); return new JsonResponse($users); } catch (\Exception $e) { $this->metrics->collect('api_error', ['count' => 1], [ 'endpoint' => '/api/users', 'error_type' => get_class($e), 'status' => '500' ]); throw $e; } } } class ApiController { public function __construct( private MetricsCollectorInterface $metrics ) {} public function getUsers(): JsonResponse { $startTime = microtime(true); try { $users = $this->userRepository->findAll(); $responseTime = (microtime(true) - $startTime) * 1000; $this->metrics->collect('api_request', [ 'response_time' => $responseTime, 'count' => 1 ], [ 'endpoint' => '/api/users', 'method' => 'GET', 'status' => '200' ]); return new JsonResponse($users); } catch (\Exception $e) { $this->metrics->collect('api_error', ['count' => 1], [ 'endpoint' => '/api/users', 'error_type' => get_class($e), 'status' => '500' ]); throw $e; } } } 2. Business Metrics in E-commerce 2. Business Metrics in E-commerce class OrderService { public function createOrder(OrderDto $dto): Order { $order = new Order($dto); $this->em->persist($order); $this->em->flush(); $this->metrics->collect('order_created', [ 'amount' => $order->getTotalAmount(), 'items_count' => $order->getItemsCount(), 'count' => 1 ], [ 'payment_method' => $order->getPaymentMethod(), 'currency' => $order->getCurrency(), 'user_type' => $order->getUser()->getType() ]); return $order; } public function processPayment(Order $order): void { $startTime = microtime(true); try { $result = $this->paymentGateway->charge($order); $this->metrics->collect('payment_processed', [ 'amount' => $order->getTotalAmount(), 'processing_time' => (microtime(true) - $startTime) * 1000, 'count' => 1 ], [ 'gateway' => $this->paymentGateway->getName(), 'status' => 'success' ]); } catch (PaymentException $e) { $this->metrics->collect('payment_failed', [ 'amount' => $order->getTotalAmount(), 'count' => 1 ], [ 'gateway' => $this->paymentGateway->getName(), 'error_code' => $e->getCode() ]); throw $e; } } } class OrderService { public function createOrder(OrderDto $dto): Order { $order = new Order($dto); $this->em->persist($order); $this->em->flush(); $this->metrics->collect('order_created', [ 'amount' => $order->getTotalAmount(), 'items_count' => $order->getItemsCount(), 'count' => 1 ], [ 'payment_method' => $order->getPaymentMethod(), 'currency' => $order->getCurrency(), 'user_type' => $order->getUser()->getType() ]); return $order; } public function processPayment(Order $order): void { $startTime = microtime(true); try { $result = $this->paymentGateway->charge($order); $this->metrics->collect('payment_processed', [ 'amount' => $order->getTotalAmount(), 'processing_time' => (microtime(true) - $startTime) * 1000, 'count' => 1 ], [ 'gateway' => $this->paymentGateway->getName(), 'status' => 'success' ]); } catch (PaymentException $e) { $this->metrics->collect('payment_failed', [ 'amount' => $order->getTotalAmount(), 'count' => 1 ], [ 'gateway' => $this->paymentGateway->getName(), 'error_code' => $e->getCode() ]); throw $e; } } } 3. Background Job Monitoring 3. Background Job Monitoring class EmailConsumer implements MessageHandlerInterface { public function __invoke(SendEmailMessage $message): void { $startTime = microtime(true); try { $this->mailer->send($message->getEmail()); $this->metrics->collect('consumer_processed', [ 'processing_time' => (microtime(true) - $startTime) * 1000, 'count' => 1 ], [ 'consumer' => 'email', 'status' => 'success', 'priority' => $message->getPriority() ]); } catch (\Exception $e) { $this->metrics->collect('consumer_failed', ['count' => 1], [ 'consumer' => 'email', 'error' => get_class($e) ]); throw $e; } } } class EmailConsumer implements MessageHandlerInterface { public function __invoke(SendEmailMessage $message): void { $startTime = microtime(true); try { $this->mailer->send($message->getEmail()); $this->metrics->collect('consumer_processed', [ 'processing_time' => (microtime(true) - $startTime) * 1000, 'count' => 1 ], [ 'consumer' => 'email', 'status' => 'success', 'priority' => $message->getPriority() ]); } catch (\Exception $e) { $this->metrics->collect('consumer_failed', ['count' => 1], [ 'consumer' => 'email', 'error' => get_class($e) ]); throw $e; } } } 4. Circuit Breaker Pattern 4. Circuit Breaker Pattern class ExternalApiClient { private int $failures = 0; private bool $isOpen = false; public function call(string $endpoint): array { if ($this->isOpen) { $this->metrics->collect('circuit_breaker', ['count' => 1], [ 'service' => 'external_api', 'state' => 'open', 'action' => 'rejected' ]); throw new CircuitBreakerOpenException(); } try { $response = $this->httpClient->request('GET', $endpoint); $this->failures = 0; $this->metrics->collect('circuit_breaker', ['count' => 1], [ 'service' => 'external_api', 'state' => 'closed', 'action' => 'success' ]); return $response->toArray(); } catch (\Exception $e) { $this->failures++; if ($this->failures >= 5) { $this->isOpen = true; $this->metrics->collect('circuit_breaker', ['count' => 1], [ 'service' => 'external_api', 'state' => 'open', 'action' => 'opened' ]); } throw $e; } } } class ExternalApiClient { private int $failures = 0; private bool $isOpen = false; public function call(string $endpoint): array { if ($this->isOpen) { $this->metrics->collect('circuit_breaker', ['count' => 1], [ 'service' => 'external_api', 'state' => 'open', 'action' => 'rejected' ]); throw new CircuitBreakerOpenException(); } try { $response = $this->httpClient->request('GET', $endpoint); $this->failures = 0; $this->metrics->collect('circuit_breaker', ['count' => 1], [ 'service' => 'external_api', 'state' => 'closed', 'action' => 'success' ]); return $response->toArray(); } catch (\Exception $e) { $this->failures++; if ($this->failures >= 5) { $this->isOpen = true; $this->metrics->collect('circuit_breaker', ['count' => 1], [ 'service' => 'external_api', 'state' => 'open', 'action' => 'opened' ]); } throw $e; } } } Aggregation in Telegraf Aggregation in Telegraf Telegraf’s killer feature — built-in aggregations (basicstats). Instead of raw data flooding Prometheus, aggregation happens directly in Telegraf. Metric Description Use casecountNumber of values per periodRequests, errors, registrationssumSum of valuesTotal revenue, processing timemeanArithmetic meanAvg response time, avg basket sizeminMinimumMin response time, smallest ordermaxMaximumPeak load, max response timestdevStandart deviationResponse time variabilitys2VarianceMore sensitive variability metric Metric Description Use casecountNumber of values per periodRequests, errors, registrationssumSum of valuesTotal revenue, processing timemeanArithmetic meanAvg response time, avg basket sizeminMinimumMin response time, smallest ordermaxMaximumPeak load, max response timestdevStandart deviationResponse time variabilitys2VarianceMore sensitive variability metric Metric Description Use casecountNumber of values per periodRequests, errors, registrationssumSum of valuesTotal revenue, processing timemeanArithmetic meanAvg response time, avg basket sizeminMinimumMin response time, smallest ordermaxMaximumPeak load, max response timestdevStandart deviationResponse time variabilitys2VarianceMore sensitive variability metric Metric Description Use case Metric Description Use case Use case countNumber of values per periodRequests, errors, registrations count count Number of values per period Requests, errors, registrations sumSum of valuesTotal revenue, processing time sum Sum of values Total revenue, processing time meanArithmetic meanAvg response time, avg basket size mean Arithmetic mean Avg response time, avg basket size minMinimumMin response time, smallest order min Minimum Min response time, smallest order maxMaximumPeak load, max response time max Maximum Peak load, max response time stdevStandart deviationResponse time variability stdev Standart deviation Response time variability s2VarianceMore sensitive variability metric s2 Variance More sensitive variability metric Example telegraf.conf [[inputs.socket_listener]] service_address = "udp://:8089" data_format = "influx" [[aggregators.basicstats]] period = "10s" drop_original = false stats = ["count", "mean", "sum", "min", "max", "stdev"] namepass = ["my_app_api_*"] [[outputs.prometheus_client]] listen = ":9273" metric_version = 2 path = "/metrics" metric_batch_size = 1000 metric_buffer_limit = 10000 [[inputs.socket_listener]] service_address = "udp://:8089" data_format = "influx" [[aggregators.basicstats]] period = "10s" drop_original = false stats = ["count", "mean", "sum", "min", "max", "stdev"] namepass = ["my_app_api_*"] [[outputs.prometheus_client]] listen = ":9273" metric_version = 2 path = "/metrics" metric_batch_size = 1000 metric_buffer_limit = 10000 Pitfalls and How to Avoid Them UDP Packet Loss — and Why It’s Fine Problem: At high load, packet loss may occur. Problem Solution: Monitor Telegraf’s own metrics. If losses are critical — increase UDP buffers or add batching in the application. Solution Remember: losing 0.01% metrics is better than app crash due to Redis. UDP Packet Size: Why Your Metrics Might Not Arrive Problem: UDP packet size limit is ~65KB. With too many tags, you can exceed it. Problem Solution: Limit unique tags and use short names: Solution // Bad: long tags with high cardinality $this->metrics->collect('api_request', ['time' => 100], [ 'user_email' => $user->getEmail(), // high cardinality 'request_id' => uniqid(), // unique every time 'full_endpoint_path_with_parameters' => $request->getUri() ]); // Good: short tags with low cardinality $this->metrics->collect('api_request', ['time' => 100], [ 'endpoint' => '/api/users', 'method' => 'GET', 'status' => '200' ]); // Bad: long tags with high cardinality $this->metrics->collect('api_request', ['time' => 100], [ 'user_email' => $user->getEmail(), // high cardinality 'request_id' => uniqid(), // unique every time 'full_endpoint_path_with_parameters' => $request->getUri() ]); // Good: short tags with low cardinality $this->metrics->collect('api_request', ['time' => 100], [ 'endpoint' => '/api/users', 'method' => 'GET', 'status' => '200' ]); Fewer unique tags = smaller packet size = more reliable delivery. Alternative Scenarios Alternative Scenarios VictoriaMetrics Instead of Prometheus VictoriaMetrics Instead of Prometheus For high-load systems, Prometheus can become a bottleneck: high memory consumption, long queries with large data volumes, and no clustering mode “out of the box.” VictoriaMetrics is fully compatible with the Prometheus protocol but: is more efficient in storage,handles long queries faster,supports horizontal scaling. is more efficient in storage, handles long queries faster, supports horizontal scaling. That makes it a more reliable choice for systems with hundreds of thousands of metrics per second. hundreds of thousands of metrics per second Sending Metrics to Multiple Systems Simultaneously Sending Metrics to Multiple Systems Simultaneously [[outputs.prometheus_client]] listen = ":9273" [[outputs.influxdb_v2]] urls = ["http://influxdb:8086"] [[outputs.graphite]] servers = ["graphite:2003"] [[outputs.prometheus_client]] listen = ":9273" [[outputs.influxdb_v2]] urls = ["http://influxdb:8086"] [[outputs.graphite]] servers = ["graphite:2003"] Roadmap and Current Limitations Roadmap and Current Limitations Already works: Already works: Production-readySymfony 6.4+ and 7.0+Prometheus / VictoriaMetrics supportedZero-overhead delivery Production-ready Symfony 6.4+ and 7.0+ Prometheus / VictoriaMetrics supported Zero-overhead delivery Note: no test suite yet, but it’s been running stable in multiple highload projects for over a year. Final Thoughts Switching to the push model with UDP + Telegraf gave us three key wins: UDP + Telegraf Performance as a competitive advantage Performance as a competitive advantage Latency reduced 60× (from 3ms to 0.05ms). At 200k RPM, that saves 10 minutes of CPU time per hour, allowing 15% more requests on the same hardware. Scaling without headaches Scaling without headaches Linear scaling — adding new servers now takes 30 seconds. Just deploy with the same UDP endpoint. No Prometheus changes, no service discovery. System antifragility System antifragility Complete isolation of failures — the metrics system can collapse entirely, and the app continues running. Over the years, this saved us multiple times during monitoring infrastructure outages. Metrics in PHP are not a luxury but a necessity to understand what’s happening in production. The Telegraf UDP approach allowed us to forget about scaling problems and focus on what really matters — business logic and user experience. Yes, we sacrificed guaranteed delivery of every packet. But in return, we got a system that withstands any load and never becomes a single point of failure — especially at critical peak moments. Bundle available on GitHub and Packagist. GitHub GitHub Packagist Packagist P.S. If this saved you time reinventing the wheel — star the repo. Found a bug? Open an issue, and we’ll fix it. P.S. If this saved you time reinventing the wheel — star the repo. Found a bug? Open an issue, and we’ll fix it. repo