Go에서 상태 모니터링 서비스를 구축하는 방법

모니터링은 신뢰할 수 있는 소프트웨어를 실행하는 데 중요한 부분이지만, 많은 팀은 사용자 불만이 롤링되기 시작한 후에만 중단을 발견합니다. 2시에 Slack 메시지를 받으면 고객이 불만을 제기하기 시작할 때까지 API가 한 시간 이상 중단되어 아무도 알지 못했습니다. 이 튜토리얼에서는 상태 모니터링 응용 프로그램을 처음부터 구축하는 방법에 대한 단계를 통해 당신을 안내 할 것입니다.By the end of this article, you will have a system that: 일정에 따라 서비스를 테스트합니다 (HTTP, TCP, DNS 등) 중단을 감지하고 다양한 통신 채널 (Teams, Slack 등)에 알림을 보냅니다. 자동 열기/닫기(Automatic Open/Close) Prometheus 및 Grafana 대시보드에 대한 메트릭스 Docker에서 실행 이 응용 프로그램을 위해 Go를 사용하겠습니다.이 응용 프로그램은 빠르기 때문에 단일 바이너리로 컴파일하여 크로스 플랫폼 지원을 위해 컴파일하고 동시에 여러 엔드포인트를 모니터링해야하는 응용 프로그램에 중요한 동시를 처리합니다.For this application, I will be using Go because it is fast, compiles to a single binary for cross platform support, and handles concurrency. 우리는 무엇을 건설하는가 우리는 Go 응용 프로그램 "StatusD"를 구축 할 것입니다. 그것은 모니터링 할 서비스 목록이있는 구성 파일을 읽고, 그들을 조사하고, 무언가가 잘못되었을 때 사고, 화재 알림을 생성합니다. Tech Stack Used: 골란 포스트그레이드 Grafana (Prometheus for metric)에 대한 리뷰 보기 도커 NGINX 다음은 높은 수준의 아키텍처입니다: ┌─────────────────────────────────────────────────────────────────┐ │ Docker Compose │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ │ │ Postgres │ │Prometheus│ │ Grafana │ │ Nginx │ │ │ │ DB │ │ (metrics)│ │(dashboard)│ │ (reverse proxy) │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────────┬─────────┘ │ │ │ │ │ │ │ │ └─────────────┴─────────────┴──────────────────┘ │ │ │ │ │ ┌─────────┴─────────┐ │ │ │ StatusD │ │ │ │ (our Go app) │ │ │ └─────────┬─────────┘ │ │ │ │ └──────────────────────────────┼──────────────────────────────────┘ │ ┌────────────────┼────────────────┐ ▼ ▼ ▼ ┌────────┐ ┌────────┐ ┌────────┐ │Service │ │Service │ │Service │ │ A │ │ B │ │ C │ └────────┘ └────────┘ └────────┘ 프로젝트 구조 우리가 코드를 작성하기 전에, 조각이 어떻게 맞는지 이해하자. status-monitor/ ├── cmd/statusd/ │ └── main.go # Application entry point ├── internal/ │ ├── models/ │ │ └── models.go # Data structures (Asset, Incident, etc.) │ ├── probe/ │ │ ├── probe.go # Probe registry │ │ └── http.go # HTTP probe implementation │ ├── scheduler/ │ │ └── scheduler.go # Worker pool and scheduling │ ├── alert/ │ │ └── engine.go # State machine and notifications │ ├── notifier/ │ │ └── teams.go # Teams/Slack integration │ ├── store/ │ │ └── postgres.go # Database layer │ ├── api/ │ │ └── handlers.go # REST API │ └── config/ │ └── manifest.go # Config loading ├── config/ │ ├── manifest.json # Services to monitor │ └── notifiers.json # Notification channels ├── migrations/ │ └── 001_init_schema.up.sql ├── docker-compose.yml ├── Dockerfile └── entrypoint.sh 핵심 데이터 모델 여기서 우리는 우리의 '종류'를 정의할 것이며, 이는 본질적으로 우리가 "모니터링된 서비스"가 어떻게 생겼는지를 정의할 것을 의미합니다. 우리는 4 개의 "타입"을 정의 할 것입니다 : 자산: 이것은 우리가 모니터링하고자하는 서비스입니다. ProbeResult: 우리가 자산을 확인할 때 일어나는 일; 응답, 지연 등 예상치 못한 응답을 ProbeResult가 반환했을 때 (및 서비스가 복구했을 때) 뭔가 잘못되었을 때 추적합니다. 알림: 이 알림은 정의된 커뮤니케이션 채널, 예를 들어 Teams, Slack, 이메일 등으로 전송된 알림 또는 메시지입니다. 코드에서 종류를 정의하자: Let's define the types in code: // internal/models/models.go package models import "time" // Asset represents a monitored service type Asset struct { ID string `json:"id"` AssetType string `json:"assetType"` // http, tcp, dns, etc. Name string `json:"name"` Address string `json:"address"` IntervalSeconds int `json:"intervalSeconds"` TimeoutSeconds int `json:"timeoutSeconds"` ExpectedStatusCodes []int `json:"expectedStatusCodes,omitempty"` Metadata map[string]string `json:"metadata,omitempty"` } // ProbeResult contains the outcome of a single health check type ProbeResult struct { AssetID string Timestamp time.Time Success bool LatencyMs int64 Code int // HTTP status code Message string // Error message if failed } // Incident tracks a service outage type Incident struct { ID string AssetID string StartedAt time.Time EndedAt *time.Time // nil if still open Severity string Summary string } // Notification is what we send to Slack/Teams type Notification struct { AssetID string AssetName string Event string // "DOWN", "RECOVERY", "UP" Timestamp time.Time Details string } 주목해 주세요 The 모든 엔드포인트가 200을 반환하는 것은 아니며 일부는 204 또는 리디렉트를 반환할 수 있습니다.이로써 각 서비스에 대해 "건강한"이 무엇을 의미하는지 정의할 수 있습니다. ExpectedStatusCodes 데이터베이스 계획 우리는 탐사 결과와 사건을 저장하는 장소가 필요합니다.우리는 이것을 위해 PostgreSQL을 사용하고 있으며 여기에 우리의 계획이 있습니다. -- migrations/001_init_schema.up.sql CREATE TABLE IF NOT EXISTS assets ( id TEXT PRIMARY KEY, name TEXT NOT NULL, address TEXT NOT NULL, asset_type TEXT NOT NULL DEFAULT 'http', interval_seconds INTEGER DEFAULT 300, timeout_seconds INTEGER DEFAULT 5, expected_status_codes TEXT, metadata JSONB, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); CREATE TABLE IF NOT EXISTS probe_events ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), asset_id TEXT NOT NULL REFERENCES assets(id), timestamp TIMESTAMP WITH TIME ZONE NOT NULL, success BOOLEAN NOT NULL, latency_ms BIGINT NOT NULL, code INTEGER, message TEXT ); CREATE TABLE IF NOT EXISTS incidents ( id SERIAL PRIMARY KEY, asset_id TEXT NOT NULL REFERENCES assets(id), severity TEXT DEFAULT 'INITIAL', summary TEXT, started_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, ended_at TIMESTAMP ); -- Indexes for common queries CREATE INDEX IF NOT EXISTS idx_probe_events_asset_id_timestamp ON probe_events(asset_id, timestamp DESC); CREATE INDEX IF NOT EXISTS idx_incidents_asset_id ON incidents(asset_id); CREATE INDEX IF NOT EXISTS idx_incidents_ended_at ON incidents(ended_at); 핵심 통찰력은 on 여기서 우리는 자산과 타임스탬프에 따라 인덱싱을 수행하고 있으며, 이는 서비스의 탐색 결과에 대한 빠른 쿼리를 허용합니다. probe_events(asset_id, timestamp DESC) 테스트 시스템 만들기 우리는 복잡한 스위치 문서를 작성할 필요 없이 HTTPS, TCP, DNS 등 여러 프로토콜 유형에 대한 탐색을 지원하고 싶습니다. 우선, 우리는 검사가 어떻게 생겼는지 정의 할 것입니다 : // internal/probe/probe.go package probe import ( "context" "fmt" "github.com/yourname/status/internal/models" ) // Probe defines the interface for checking service health type Probe interface { Probe(ctx context.Context, asset models.Asset) (models.ProbeResult, error) } // registry holds all probe types var registry = make(map[string]func() Probe) // Register adds a probe type to the registry func Register(assetType string, factory func() Probe) { registry[assetType] = factory } // GetProbe returns a probe for the given asset type func GetProbe(assetType string) (Probe, error) { factory, ok := registry[assetType] if !ok { return nil, fmt.Errorf("unknown asset type: %s", assetType) } return factory(), nil } 이제 HTTP Sond을 실행하십시오 : // internal/probe/http.go package probe import ( "context" "io" "net/http" "time" "github.com/yourname/status/internal/models" ) func init() { Register("http", func() Probe { return &httpProbe{} }) } type httpProbe struct{} func (p *httpProbe) Probe(ctx context.Context, asset models.Asset) (models.ProbeResult, error) { result := models.ProbeResult{ AssetID: asset.ID, Timestamp: time.Now(), } client := &http.Client{ Timeout: time.Duration(asset.TimeoutSeconds) * time.Second, } req, err := http.NewRequestWithContext(ctx, http.MethodGet, asset.Address, nil) if err != nil { result.Success = false result.Message = err.Error() return result, err } start := time.Now() resp, err := client.Do(req) result.LatencyMs = time.Since(start).Milliseconds() if err != nil { result.Success = false result.Message = err.Error() return result, err } defer resp.Body.Close() // Read body (limit to 1MB) io.ReadAll(io.LimitReader(resp.Body, 1024*1024)) result.Code = resp.StatusCode // Check if status code is expected if len(asset.ExpectedStatusCodes) > 0 { for _, code := range asset.ExpectedStatusCodes { if code == resp.StatusCode { result.Success = true return result, nil } } result.Success = false result.Message = "unexpected status code" } else { result.Success = resp.StatusCode < 400 } return result, nil } init() 함수는 Go 응용 프로그램이 시작될 때 자동으로 실행됩니다.This adds the HTTP probe to the registry without any code change. TCP probes를 추가하고 싶습니까?Create , 인터페이스를 구현하고, 그것을 등록 . tcp.go init() 일정 및 경쟁 우리는 일정에 따라 모든 자산을 탐색해야하며, 이를 위해 우리는 노동자 풀을 사용할 것입니다.노동자 풀은 각 서비스에 대한 goroutine을 낳지 않고도 여러 탐사를 동시에 실행할 수 있습니다. // internal/scheduler/scheduler.go package scheduler import ( "context" "sync" "time" "github.com/yourname/status/internal/models" "github.com/yourname/status/internal/probe" ) type JobHandler func(result models.ProbeResult) type Scheduler struct { workers int jobs chan models.Asset tickers map[string]*time.Ticker handler JobHandler mu sync.Mutex done chan struct{} wg sync.WaitGroup } func NewScheduler(workerCount int, handler JobHandler) *Scheduler { return &Scheduler{ workers: workerCount, jobs: make(chan models.Asset, 100), tickers: make(map[string]*time.Ticker), handler: handler, done: make(chan struct{}), } } func (s *Scheduler) Start(ctx context.Context) { for i := 0; i < s.workers; i++ { s.wg.Add(1) go s.worker(ctx) } } func (s *Scheduler) ScheduleAssets(assets []models.Asset) error { s.mu.Lock() defer s.mu.Unlock() for _, asset := range assets { interval := time.Duration(asset.IntervalSeconds) * time.Second ticker := time.NewTicker(interval) s.tickers[asset.ID] = ticker s.wg.Add(1) go s.scheduleAsset(asset, ticker) } return nil } func (s *Scheduler) scheduleAsset(asset models.Asset, ticker *time.Ticker) { defer s.wg.Done() for { select { case <-s.done: ticker.Stop() return case <-ticker.C: s.jobs <- asset } } } func (s *Scheduler) worker(ctx context.Context) { defer s.wg.Done() for { select { case <-s.done: return case asset := <-s.jobs: p, err := probe.GetProbe(asset.AssetType) if err != nil { continue } result, _ := p.Probe(ctx, asset) s.handler(result) } } } func (s *Scheduler) Stop() { close(s.done) close(s.jobs) s.wg.Wait() } 각 자산은 스케줄만 작동하는 자체 ticker goroutine을 얻습니다. 자산을 확인할 때, ticker는 채널에 탐사 작업을 보냅니다. 채널을 듣고 실제 탐사 작업을 수행하는 고정된 수의 작업 goroutine이 있습니다. 우리는 탐색기가 네트워크 응답이나 타임 아웃을 기다리는 동안 차단 할 수 있기 때문에 탐색기를 직접 타이커 goroutines에서 실행하지 않습니다. 예를 들어, 4명의 노동자와 100개의 자산이 있는 경우, 탐지기들이 동시에 화재를 일으키더라도 언제든지 4개의 탐지기가 실행됩니다.The channel acts as a buffer for pending jobs, and a 모든 직원이 깨끗하게 닫히는 것을 보장합니다. sync.WaitGroup 현지 이름: The State Machine 탐색이 실패하면 자동으로 실패를 가정하지 않습니다.그것은 네트워크 오류 일 수 있습니다.그러나, 다시 실패하면, 우리는 사건을 생성합니다.이 사건이 회복되면, 우리는 사건을 종료하고 알립니다. 이것은 상태 기계입니다: UP → DOWN → UP. 엔진을 만드는 방법 : // internal/alert/engine.go package alert import ( "context" "fmt" "sync" "time" "github.com/yourname/status/internal/models" "github.com/yourname/status/internal/store" ) type NotifierFunc func(ctx context.Context, notification models.Notification) error type AssetState struct { IsUp bool LastProbeTime time.Time OpenIncidentID string } type Engine struct { store store.Store notifiers map[string]NotifierFunc mu sync.RWMutex assetState map[string]AssetState } func NewEngine(store store.Store) *Engine { return &Engine{ store: store, notifiers: make(map[string]NotifierFunc), assetState: make(map[string]AssetState), } } func (e *Engine) RegisterNotifier(name string, fn NotifierFunc) { e.mu.Lock() defer e.mu.Unlock() e.notifiers[name] = fn } func (e *Engine) Process(ctx context.Context, result models.ProbeResult, asset models.Asset) error { e.mu.Lock() defer e.mu.Unlock() state := e.assetState[result.AssetID] state.LastProbeTime = result.Timestamp // State hasn't changed? Nothing to do. if state.IsUp == result.Success { e.assetState[result.AssetID] = state return nil } // Save probe event if err := e.store.SaveProbeEvent(ctx, result); err != nil { return err } if result.Success && !state.IsUp { // Recovery! return e.handleRecovery(ctx, asset, state) } else if !result.Success && state.IsUp { // Outage! return e.handleOutage(ctx, asset, state, result) } return nil } func (e *Engine) handleOutage(ctx context.Context, asset models.Asset, state AssetState, result models.ProbeResult) error { incidentID, err := e.store.CreateIncident(ctx, asset.ID, fmt.Sprintf("Service %s is down", asset.Name)) if err != nil { return err } state.IsUp = false state.OpenIncidentID = incidentID e.assetState[asset.ID] = state notification := models.Notification{ AssetID: asset.ID, AssetName: asset.Name, Event: "DOWN", Timestamp: result.Timestamp, Details: result.Message, } return e.sendNotifications(ctx, notification) } func (e *Engine) handleRecovery(ctx context.Context, asset models.Asset, state AssetState) error { if state.OpenIncidentID != "" { e.store.CloseIncident(ctx, state.OpenIncidentID) } state.IsUp = true state.OpenIncidentID = "" e.assetState[asset.ID] = state notification := models.Notification{ AssetID: asset.ID, AssetName: asset.Name, Event: "RECOVERY", Timestamp: time.Now(), Details: "Service has recovered", } return e.sendNotifications(ctx, notification) } func (e *Engine) sendNotifications(ctx context.Context, notification models.Notification) error { for name, notifier := range e.notifiers { if err := notifier(ctx, notification); err != nil { fmt.Printf("notifier %s failed: %v\n", name, err) } } return nil } Key insight: We track the state in memory (기억에서 상태를 추적한다) 빠른 검색을 위해, 그러나 지속 가능성을 위해 데이터베이스에 계속 발생. 프로세스가 재시작되면, 우리는 개방 된 사건에서 상태를 재구성할 수 있습니다. assetState Notifications 보내기 무언가가 깨진 경우, 사람들은 알아야합니다.우리는 다양한 통신 채널에 알림을 보내야합니다. 우리의 팀 알림을 정의하자 : // internal/notifier/teams.go package notifier import ( "bytes" "context" "encoding/json" "fmt" "net/http" "time" "github.com/yourname/status/internal/models" ) type TeamsNotifier struct { webhookURL string client *http.Client } func NewTeamsNotifier(webhookURL string) *TeamsNotifier { return &TeamsNotifier{ webhookURL: webhookURL, client: &http.Client{Timeout: 10 * time.Second}, } } func (t *TeamsNotifier) Notify(ctx context.Context, n models.Notification) error { emoji := "🟢" if n.Event == "DOWN" { emoji = "🔴" } card := map[string]interface{}{ "type": "message", "attachments": []map[string]interface{}{ { "contentType": "application/vnd.microsoft.card.adaptive", "content": map[string]interface{}{ "$schema": "http://adaptivecards.io/schemas/adaptive-card.json", "type": "AdaptiveCard", "version": "1.4", "body": []map[string]interface{}{ { "type": "TextBlock", "text": fmt.Sprintf("%s %s - %s", emoji, n.AssetName, n.Event), "weight": "Bolder", "size": "Large", }, { "type": "FactSet", "facts": []map[string]interface{}{ {"title": "Service", "value": n.AssetName}, {"title": "Status", "value": n.Event}, {"title": "Time", "value": n.Timestamp.Format(time.RFC1123)}, {"title": "Details", "value": n.Details}, }, }, }, }, }, }, } body, _ := json.Marshal(card) req, _ := http.NewRequestWithContext(ctx, "POST", t.webhookURL, bytes.NewReader(body)) req.Header.Set("Content-Type", "application/json") resp, err := t.client.Do(req) if err != nil { return err } defer resp.Body.Close() if resp.StatusCode >= 300 { return fmt.Errorf("Teams webhook returned %d", resp.StatusCode) } return nil } 팀은 풍부한 포맷을 위해 Adaptive Cards를 사용합니다.You can define various notifiers for other communication channels, e.g. Slack, Discord, etc. 나머지 API 우리는 우리가 모니터링하는 서비스의 상태를 쿼리하기 위해 엔드포인트가 필요합니다.이를 위해, 우리는 Chi를 사용합니다, 이는 경로 매개 변수를 지원하는 가벼운 라우터입니다. . /assets/{id} Apis를 정의하십시오 : // internal/api/handlers.go package api import ( "encoding/json" "net/http" "github.com/go-chi/chi/v5" "github.com/go-chi/chi/v5/middleware" "github.com/yourname/status/internal/store" ) type Server struct { store store.Store mux *chi.Mux } func NewServer(s store.Store) *Server { srv := &Server{store: s, mux: chi.NewRouter()} srv.mux.Use(middleware.Logger) srv.mux.Use(middleware.Recoverer) srv.mux.Route("/api", func(r chi.Router) { r.Get("/health", srv.health) r.Get("/assets", srv.listAssets) r.Get("/assets/{id}/events", srv.getAssetEvents) r.Get("/incidents", srv.listIncidents) }) return srv } func (s *Server) ServeHTTP(w http.ResponseWriter, r *http.Request) { s.mux.ServeHTTP(w, r) } func (s *Server) health(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(map[string]string{"status": "healthy"}) } func (s *Server) listAssets(w http.ResponseWriter, r *http.Request) { assets, err := s.store.GetAssets(r.Context()) if err != nil { http.Error(w, err.Error(), 500) return } w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(assets) } func (s *Server) getAssetEvents(w http.ResponseWriter, r *http.Request) { id := chi.URLParam(r, "id") events, _ := s.store.GetProbeEvents(r.Context(), id, 100) w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(events) } func (s *Server) listIncidents(w http.ResponseWriter, r *http.Request) { incidents, _ := s.store.GetOpenIncidents(r.Context()) w.Header().Set("Content-Type", "application/json") json.NewEncoder(w).Encode(incidents) } 위의 코드는 작은 HTTP API 서버를 정의하여 4개의 읽기 전용 엔드포인트를 노출합니다.The code above defines a small HTTP API server, which exposes 4 read-only endpoints: GET /api/health - 건강 검사 (서비스가 작동합니까?) GET /api/assets - 모니터링된 모든 서비스 목록 GET /api/assets/{id}/events - 특정 서비스에 대한 탐색 기록을 가져오기 GET /api/incidents - 리스트 오픈 incidents Dockering 응용 프로그램 Go가 단일 바이너리로 컴파일하기 때문에 애플리케이션을 dockerizing하는 것은 상당히 직진적입니다.We are going to be using a multi-stage build to keep the final image small: # Dockerfile FROM golang:1.24-alpine AS builder WORKDIR /app RUN apk add --no-cache git COPY go.mod go.sum ./ RUN go mod download COPY . . RUN CGO_ENABLED=0 GOOS=linux go build -o statusd ./cmd/statusd/ FROM alpine:latest WORKDIR /app RUN apk --no-cache add ca-certificates COPY --from=builder /app/statusd . COPY entrypoint.sh . RUN chmod +x /app/entrypoint.sh EXPOSE 8080 ENTRYPOINT ["/app/entrypoint.sh"] 최종 단계는 Alpine plus 우리의 바이너리 - 일반적으로 20MB 미만입니다. 입력 포인트 스크립트는 환경 변수에서 데이터베이스 연결 문자열을 구축합니다.The entry point script builds the database connection string from environment variables: #!/bin/sh # entrypoint.sh DB_HOST=${DB_HOST:-localhost} DB_PORT=${DB_PORT:-5432} DB_USER=${DB_USER:-status} DB_PASSWORD=${DB_PASSWORD:-status} DB_NAME=${DB_NAME:-status_db} DB_CONN_STRING="postgres://${DB_USER}:${DB_PASSWORD}@${DB_HOST}:${DB_PORT}/${DB_NAME}" exec ./statusd \ -manifest /app/config/manifest.json \ -notifiers /app/config/notifiers.json \ -db "$DB_CONN_STRING" \ -workers 4 \ -api-port 8080 Docker Compose: Putting It All Together에 대한 리뷰 보기 하나의 파일은 그들 모두를 지배 : # docker-compose.yml version: "3.8" services: postgres: image: postgres:15-alpine container_name: status_postgres environment: POSTGRES_USER: status POSTGRES_PASSWORD: changeme POSTGRES_DB: status_db volumes: - postgres_data:/var/lib/postgresql/data - ./migrations:/docker-entrypoint-initdb.d healthcheck: test: ["CMD-SHELL", "pg_isready -U status"] interval: 10s timeout: 5s retries: 5 networks: - status_network statusd: build: . container_name: status_app environment: - DB_HOST=postgres - DB_PORT=5432 - DB_USER=status - DB_PASSWORD=changeme - DB_NAME=status_db volumes: - ./config:/app/config:ro depends_on: postgres: condition: service_healthy networks: - status_network prometheus: image: prom/prometheus:latest container_name: status_prometheus volumes: - ./docker/prometheus.yml:/etc/prometheus/prometheus.yml - prometheus_data:/prometheus networks: - status_network depends_on: - statusd grafana: image: grafana/grafana:latest container_name: status_grafana environment: GF_SECURITY_ADMIN_USER: admin GF_SECURITY_ADMIN_PASSWORD: admin volumes: - grafana_data:/var/lib/grafana networks: - status_network depends_on: - prometheus nginx: image: nginx:alpine container_name: status_nginx volumes: - ./docker/nginx/nginx.conf:/etc/nginx/nginx.conf:ro - ./docker/nginx/conf.d:/etc/nginx/conf.d:ro ports: - "80:80" depends_on: - statusd - grafana - prometheus networks: - status_network networks: status_network: driver: bridge volumes: postgres_data: prometheus_data: grafana_data: 몇 가지 주목해야 할 사항: PostgreSQL 건강검진: statusd 서비스는 Postgres가 실제로 준비되었을 때까지 기다리고 있습니다.이것은 첫 번째 부팅에서 "연결 거부"오류를 방지합니다. Config mount: We mount ./config as read-only. Edit your manifesto locally, and the running container sees the changes. [개인정보] [개인정보] Nginx: 외부 트래픽을 Grafana 및 Prometheus 대시보드로 라우팅합니다.Routes external traffic to Grafana and Prometheus dashboards. 구성 파일 응용 프로그램은 두 개의 파일을 읽습니다 : 그리고 manifest.json notifiers.json The file lists the assets we want to monitor. Each asset needs an ID, a probe type, and an address. The controls how often we check (60 = once per minute). lets you define what "healthy" means. Some endpoints return 301 redirects or 204 No Content, and that's fine. manifest.json intervalSeconds expectedStatusCodes // config/manifest.json { "assets": [ { "id": "api-prod", "assetType": "http", "name": "Production API", "address": "https://api.example.com/health", "intervalSeconds": 60, "timeoutSeconds": 5, "expectedStatusCodes": [200], "metadata": { "env": "prod", "owner": "platform-team" } }, { "id": "web-prod", "assetType": "http", "name": "Production Website", "address": "https://www.example.com", "intervalSeconds": 120, "timeoutSeconds": 10, "expectedStatusCodes": [200, 301] } ] } The controls where to send alerts. You define notification channels (Teams, Slack), then set policies for which channels fire on which events. means you won't get spammed more than once every 5 minutes for the same issue. notifiers.json throttleSeconds: 300 // config/notifiers.json { "notifiers": { "teams": { "type": "teams", "webhookUrl": "https://outlook.office.com/webhook/your-webhook-url" } }, "notificationPolicy": { "onDown": ["teams"], "onRecovery": ["teams"], "throttleSeconds": 300, "repeatAlerts": false } } 달리기 IT docker-compose up -d 그것은 그것입니다.Five services spin up: PostgreSQL 당신의 데이터를 저장 StatusD는 귀하의 서비스를 테스트합니다 Prometheus Collects 메트릭스 도서관을 보여주고 있습니다 (http://localhost:80) Nginx Routes 모든 것 로그를 확인하십시오 : docker logs -f status_app 당신은 볼 수 있어야합니다 : Loading assets manifest... Loaded 2 assets Loading notifiers config... Loaded 1 notifiers Connecting to database... Starting scheduler... [✓] Production API (api-prod): 45ms [✓] Production Website (web-prod): 120ms 요약 이제 당신은 모니터링 시스템을 가지고 있습니다 : JSON Config에서 서비스 읽기 Workers pool를 사용하여 일정을 테스트하십시오.Try them on a schedule using a worker pool 중단을 감지하고 사고를 생성합니다.Detects interruptions and creates incidents Teams/Slack에 알림을 보내기 Prometheus에 대한 메트릭스 하나의 명령어로 Docker에서 실행 이 튜토리얼은 당신이 작동 모니터링 시스템을 배포하는 데 도움이 될 것입니다.그러나 우리가 덮인 모드 아래에 더 많은 것이 있습니다.두 번째 부분에서 우리는 다음에 대해 이야기 할 것입니다. 회로 브레이커는 서비스가 플래핑 할 때 캐스케딩 실패를 방지합니다. 엔지니어 on-call가 응답하지 않는 경우 멀티 레이어 에스컬레이션 경고 관리자 알림 폭풍우를 방지하는 경고 deduplication Adaptive probe intervals checks more frequently during incidents(사태 중에 더 자주 검사하는 적응성 탐사 간격) 서비스 재시작 없이 핫 리로드 구성 SLA 계산 및 준수 추적