paint-brush
Prometheus Alertmanager GPUs kaqpi allinchasqa ML Kawsay Ciclo kaqpaq churayby@danielcrouch
Musuq historia

Prometheus Alertmanager GPUs kaqpi allinchasqa ML Kawsay Ciclo kaqpaq churay

by Daniel11m2024/10/12
Read on Terminal Reader

Nishu unay; Ñawinchanapaq

Cantidad chaymanta imaymana willaykuna kallpachanku wichariyta complejo chaymanta sofisticado ML algoritmokuna AI llamkanakuna llamk'achinapaq.
featured image - Prometheus Alertmanager GPUs kaqpi allinchasqa ML Kawsay Ciclo kaqpaq churay
Daniel HackerNoon profile picture
0-item

Cantidad chaymanta imaymana willaykuna kallpachanku wichariyta complejo chaymanta sofisticado ML algoritmokuna AI llamkanakuna llamk'achinapaq. Kay algoritmokuna GPUkuna allin llamk'anankupaq mañanku chaymanta planeta-escala willay huñukunata llamk'achinankupaq ruwanakuna riqsinankupaq. GPUs ancha allin kanku chanin chaymanta ruwayta quypi sasa yupaykunata hapispa, chaymanta mañakuyninku hanaq pachaman wicharin ML/AI kamachiypi.


Allin kaptinpas, GPUkunaqa ancha chaninniyuqmi. Ñawpaqman purisqa AI/ML llamkanakuna sinchi qhaway atiy chaymanta kamachiy estrategiakunaq rantinpi chiqa ruwaykunamanta munanku ruwayta chaymanta unaypaq kaqta kallpachanankupaq chaymanta qullqikuna allinchaypaq. GPU-nivel mitricakuna huñuyqa yanapakunmanmi mana kaqlla ruwaykunata quyta, AI/ML kawsay muyuta allinchaspa.

Kimsa Huñunapaq GPU mitricakuna

Llika llamkanakuna achka etapakunamanta ruwasqa kanku kayhina willayta kargachiy, qallariy, tikray chaymanta willay qillqay. Chay kikin etapakuna makina yachay llamkanakuna cargakuna sasa yupaykunawan ruwakun. GPU nisqap chikan rakinkunam sapa etapapi mañakuy hunt'anapaq llamk'achisqa. Importantemi ingenieria equipokuna yachanankupaq asignación chaymanta utilización métricas sapa kuti yachachiypaq chaymanta mast'ariypaq. Kayqa yanapanmi allin yachaywan tanteayta hinaspa 100% recursokunata aprovechayta, chaynapi aswan achka valorta hurqunapaq.


Maquina yachaymanta, kay qatiq GPU componentekuna llamkana puriy etapas kaqpaq llamk'achisqa kanku mayk'aq modelo yachachiy qallarisqa. GPU componentes chaymanta métricas rikuchisqankuta hamut'asunchik. Tukuchanapaq, yachasunchik imayna hap’iyta chaymanta aprovechayta Prometheus alertmanager kaqmanta tukuy ML kawsay ciclo allinchaypaq.




  1. GPU Yuyarina : Willaypas ML pipelinepas anchatam yuyarinapi hapipakunku. Hatun willaykunata ruwanapaq, willaykunaqa yuyarinapi yupasqa kanku aswan utqaylla ruwaykunapaq. Modelo llasayninkuna, gradientekuna chaymanta wak hiperparámetros/variables nisqakuna GPU yuyarinapi kargasqa kanku. Yuyayta llamk'achiyta qhawayqa yanapakunman escala chaymanta modelopa entrenamiento usqaylla kallpachay.
  2. GPU Cores : Mayk'aq modelokuna matriz-intensive ruwanakunata ruwanku chaymanta ñawpaqman/qhipaman pasaykunata churanku, GPU núcleos kay ruwanakunata ruwanqa. Tensor nukleo ruway mitricakuna yanapakunku mayk'a allintachus unidadkuna hardware llamk'achisqa kanku chaymanta allinchaypaq tiyanayuq kanku.
  3. SM Reloj Frecuencias : GPU núcleos nisqapi puriq llamk'aykuna, mayu achka ruwaqkunata (SM) munan, munasqa yupay yupaykunata ruwanapaq. Relojpa frecuenciankunaqa yanapanmanmi yupaypa utqaylla kayninta yachanapaq.

GPU Métricas nisqakunata hap'iy

GPU kaqpi chiqalla bash scriptkuna purichiyqa mana Pythonpa flexibilidadninta qunchu. GPU mitricakunata tapuyta chaymanta t'aqwiyta atiyku tukuy yachachiy ruwaypi ruwayta hamut'anapaq. Kay mitricakuna aprovechaspa, alertas chaymanta allichaykunata churayta atiykuman sinchi escenariokuna ruwanapaq.


Flexibilidad chaymanta mast'ariy atiyta qhawaspa, Prometheus churasaqku raspaypaq chaymanta alertas umbral kaqpi llamk'achinapaq chaymanta Python llamk'achisaqku raspaypaq chaymanta GPU mitricakuna registropaq.

1. Yuyarinapaq mitrukuna

GPU NVIDIA kaqta suyaspa chaymanta NVIDIA DCGM Exporter churay GPU kaqpi hunt'asqa kachkan. Prometheus config kaqmanta riqsichisaqku qhawaypaq chaymanta raspaypaq métricas kaqmanta umbral kaqpi chaymanta slack willayta llamk'achisaq sichus umbral aswanta ruwan.


GPU huk VPC kaqpi qusqa subred 172.28.61.90 kaqpi chaymanta 9400 puerto kaqpi Prometheus configs kaqnintakama rikuchisqa kaqman targeting.


 configs_scrapper: - job_name: 'nvidia_gpu_metrics_scrapper' static_configs: - targets: ['172.28.61.90:9400']


Meta nisqa sut'inchasqa kaptin, sapa iskay minutumanta yuyarina llamk'achiyta qhawanapaq expresionta hurqusunman. Umbral 80% kaqmanta aswan kaqtinqa, huk alerta critica nisqa qallarikun.


 groups: -name: GPU_Memory_Alerts rules: - alert: HighGPUMemoryUsage expr: (dcgm_gpu_mem_used / dcgm_gpu_mem_total) * 100 > 80 for: 2m   labels: severity: critical   annotations:summary: "GPU Memory usage is high on {{ $labels.instance }}" description: "GPU memory utilization is over 80% from more than 2 minutes on {{ $labels.instance }}."


Chaymantaqa mitricakuna alertas hina apachisqa kanman. Slack aswan facil tinkinakuy akllanakunayuq kanku alertas churanapaq. Chaymi, uray YAML config llamk'achispa, qutukunaman utaq sapalla usernamekunaman Slack kaqpi alertayta atichiyta atiykuman.


 global: resolve_timeout: 2m route: receiver: 'slack_memory_notifications' group_wait: 5s group_interval: 2m repeat_interval: 1h receivers: - name: 'slack_memory_notifications' slack_configs: - api_url: 'https://databracket.slack.com/services/shrad/webhook/url' channel: 'databracket' username: 'Prometheus_Alertmanager' send_resolved: true title: 'GPU Memory Utilization >80% Alert' text: "A high memory utilization was observed triggering alert on GPU."


Monitoreo chaymanta alerta allin kanku chaymanta pisi beneficiokuna qunku, imaynachus willawanchik mayk'aq imapas mana allin kaptin. T’aqwinapaq métricas nisqakunata hap’inanchismi, allin yachaywantaqmi tanteananchis.


Kay escenario kaqpaq, DuckDB kaqmanta willaykunata kamachiypaq chaymanta boto3 kaqmanta AWS S3 manipulación kaqpaq qhawasunchik. Nvidia kamachiy biblioteca (pynvml) kaqwan, GPU kaqman yaykuyta chaymanta kamachiyta atiyku codigo kaqnintakama. S3 nisqaman mitricakunata qillqasunchik parquet archivokuna hina. Persistenciapaq, registrokunata huk yuyarina ukhupi willay tantanaman qillqasaqku DuckDB kaqwan chaymanta huk instantánea willayta S3 kaqman qillqasaqku ad hoc utaq chiqa pacha t'aqwiypaq.


 import time import pynvml import duckdb import boto3 import osfrom datetime import datetime pynvml.nvmlInit() s3 = boto3.client('s3') con = duckdb.connect(database=':memory:') con.execute(''' CREATE TABLE gpu_memory_usage ( Timestamp VARCHAR, Overall_memory DOUBLE, Memory_in_use DOUBLE, Available_memory DOUBLE ) ''') def get_gpu_memory_info(gpu_id=0): handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id) memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle) return { "Overall_memory": memory_info.total / (1024 ** 2), "Memory_in_use": memory_info.used / (1024 ** 2), "Available_memory": memory_info.free / (1024 ** 2) } def upload_parquet_to_s3(bucket_name, object_name, file_path): try: s3.upload_file(file_path, bucket_name, object_name) print(f"Parquet file uploaded to S3: {object_name}") except Exception as e: print(f"Failed to upload Parquet to S3: {e}") def log_memory_utilization_to_parquet(bucket_name, filename, gpu_id=0, interval=1.0, local_file_path='gpu_memory_stats.parquet'): try: while True: gpu_memory_info = get_gpu_memory_info(gpu_id) timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S') con.execute(''' INSERT INTO gpu_memory_usage VALUES (?, ?, ?, ?) ''', (timestamp, gpu_memory_info['Overall_memory'], gpu_memory_info['Memory_in_use'], gpu_memory_info['Available_memory'])) print(f"Logged at {timestamp}: {gpu_memory_info}") if int(datetime.now().strftime('%S')) %60 == 0: # Upload Every Minute object_name = f"{filename}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet" upload_parquet_to_s3(bucket_name, object_name, local_file_path) time.sleep(interval) except KeyboardInterrupt: print("Logging stopped by user.") bucket_name = 'prometheus-alerts-cof' filename = 'gpu_memory_stats' log_memory_utilization_to_parquet(bucket_name, filename, gpu_id=0, interval=2.0)


2. GPU nisqap ukhun mit'akuna

GPU nisqakunaqa tensor nisqap ukhunkunawanmi ancha riqsisqa . Kaykunaqa unidadkuna hardware kaqmanta kanku mayqinkunachus achka dimensional willayta ruwayta atinku. Importantemi entiendenapaq imaynatas chay núcleos nisqakuna cargakunata rakichkanku utaq procesanku chaymanta haykapi umbral nisqaman chayasqankuta. Auto-escalamiento kamachiykunata kay alertas kaqninta ruwayta atiykuman llamkanakuna llamk'achinapaq chaymanta mana llumpay ruphaypaq utaq urmaypaq.


Yuyarina qhawayman rikch'akuq, ruwanakunata churasaqku qhawanapaq chaymanta hap'inapaq GPU núcleo llamk'achiyta mitricakuna. Sapa minutopaq, mayk'aq núcleo llamk'achiyta 80% kaqmanta aswanta ruwan, huk critica alerta kachasqa kanqa, chaymanta chawpi llamk'achinapaq, huk estado musuqyachiy sapa phichqa minutomanta apachisqa kanqa.


 groups: - name: gpu_alerts rules: - alert: HighGPUCoreUtilization expr: gpu_core_utilization > 80 for: 1m labels: severity: critical annotations: summary: "GPU Core Utilization >80% Alert" description: "GPU core utilization is above 80% for over 1 minute." - alert: MediumGPUCoreUtilization expr: gpu_core_utilization > 50 for: 5m labels: severity: warning annotations: summary: "GPU Memory Utilization = Moderate" description: "GPU core utilization is above 50% for over 5 minutes."


Kaypi, dispositivo hapina índice kaqmanta hapisunchik chaymanta huk métodota waqyasunchik mayqinchus tasas de utilización kaqmanta kutichin. Chaymanta kutichiyqa yuyarina ukhupi DuckDB willay tantanaman yaykun chaymanta s3 cuboman churasqa pacha sello kaqwan.


 con.execute(''' CREATE TABLE gpu_core_usage ( Timestamp VARCHAR, GPU_Utilization_Percentage DOUBLE ) ''') def get_gpu_utilization(gpu_id=0): """Returns the GPU utilization percentage.""" handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id) utilization = pynvml.nvmlDeviceGetUtilizationRates(handle) return utilization.gpu def log_gpu_utilization_to_parquet(bucket_name, filename, gpu_id=0, interval=1.0, local_file_path='gpu_core_stats.parquet'): """Logs GPU utilization to a Parquet file and uploads it to S3 periodically.""" try: while True: gpu_utilization = get_gpu_utilization(gpu_id) timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S') con.execute(''' INSERT INTO gpu_core_usage VALUES (?, ?) ''', (timestamp, gpu_utilization)) print(f"Logged at {timestamp}: GPU Utilization = {gpu_utilization}%") if int(datetime.now().strftime('%S')) % 60 == 0: con.execute(f"COPY gpu_core_usage TO '{local_file_path}' (FORMAT PARQUET)") object_name = f"{filename}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet" upload_parquet_to_s3(bucket_name, object_name, local_file_path) con.execute("DELETE FROM gpu_core_usage") time.sleep(interval) except KeyboardInterrupt: print("Logging stopped by user.") # Example usage: bucket_name = 'prometheus-alerts-cof' filename = 'gpu_core_stats' log_gpu_utilization_to_parquet(bucket_name, filename, gpu_id=0, interval=2.0)

3. SM Reloj Frecuencia Métricas nisqa

Utqaylla mayqinpichus yupaykuna ruwakun chay chiqamanta proporcional kachkan chaymanta chay streaming multiprocesador reloj frecuencia kaqwan. SM reloj frecuencia métrica yanapan yachanapaq mayk'a utqayllachus tensor utaq ML yupaykuna llamk'achkanku chaymanta tukuchkanku.


Prometheus atichiyta atiykuman alertas llamk'achiyta mayk'aq SM reloj frecuencia 2000MHz kaqmanta aswanta. Alertas de advertencia nisqakunata churayta atiykuman willasqa kanaykupaq mayk'aqchus frecuencia límite kaqman qayllaykuchkan.


 groups: - name: gpu_sm_clock_alerts rules: - alert: LowSMClockFrequency expr: gpu_sm_clock_frequency >= 1000 for: 1m labels: severity: warning annotations: summary: "Low SM Clock Frequency" description: "The SM clock frequency is below 500 MHz for over 1 minute." - alert: HighSMClockFrequency expr: gpu_sm_clock_frequency > 2000 for: 1m labels: severity: critical annotations: summary: "High SM Clock Frequency" description: "The SM clock frequency is above 2000 MHz for over 1 minute."


SM reloj mitricakuna huñusqa mana sasachu scripting kanman mitricakuna yuyarina ukhupi willaykunapi chaymanta S3 kaqpi qillqanapaq.


 con.execute(''' CREATE TABLE sm_clock_usage ( Timestamp VARCHAR, SM_Clock_Frequency_MHz INT ) ''') def get_sm_clock_frequency(gpu_id=0): handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id) sm_clock = pynvml.nvmlDeviceGetClockInfo(handle, pynvml.NVML_CLOCK_SM) return sm_clock def log_sm_clock_to_parquet(bucket_name, filename, gpu_id=0, interval=1.0, local_file_path='sm_clock_stats.parquet'): try: while True: sm_clock_frequency = get_sm_clock_frequency(gpu_id) timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S') con.execute(''' INSERT INTO sm_clock_usage VALUES (?, ?) ''', (timestamp, sm_clock_frequency)) print(f"Logged at {timestamp}: SM Clock Frequency = {sm_clock_frequency} MHz") if int(datetime.now().strftime('%S')) % 10 == 0: con.execute(f"COPY sm_clock_usage TO '{local_file_path}' (FORMAT PARQUET)") object_name = f"{filename}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet" upload_parquet_to_s3(bucket_name, object_name, local_file_path) con.execute("DELETE FROM sm_clock_usage") time.sleep(interval) except KeyboardInterrupt: print("Logging stopped by user.") bucket_name = 'prometheus-alerts-cof' filename = 'sm_clock_stats' log_sm_clock_to_parquet(bucket_name, filename, gpu_id=0, interval=2.0)


Kay mitricakunata aprovechaspa ML ingenierokuna yachayta atinku imakunachus capucha urapi ruwakuchkan. Paykunaqa yachachiy ruwayta allinchayta atinku kay métricas kaqmanta t'aqwispa chaymanta allichaykunata churaspa hatun criticidad kaqpaq chaymanta ñawpaqman churaypaq alertas kaqpaq.

Tukupay Yuyaykuna

Maquina aprendizaje modelo yachachiyqa huk sasachakuyniyuq hinaspa sasachakuyniyuq ruwaymi. Imaynan mana allin pruebakunayoq, estadísticas nisqakunayoq ima, sasataqmi tukupay mayqen modelo variantes nisqakunan hatun inferenciawan predicciones nisqakunata rikuchinku. GPU mitricakuna necesitayku imayna yupay instancia ML llamkanakuna llamk'achiyta chaymanta ruwanapaq responsable llamk'achkan chayta hamut'anapaq. Suficiente métricas kaqwan chaymanta chiqa pacha alertas kaqwan, ML equipokuna sinchi chaymanta unaypaq ML pipelinekuna churayta chaymanta agilizar ruwayta atinku chaymanta tukuy ML kawsay muyuta allinchanku.