paint-brush
Ku Veka Prometheus Alertmanager eka ti-GPU ta Xirhendzevutani xa Vutomi bya ML lexi Antswiwekehi@danielcrouch
Matimu lamantshwa

Ku Veka Prometheus Alertmanager eka ti-GPU ta Xirhendzevutani xa Vutomi bya ML lexi Antswiweke

hi Daniel11m2024/10/12
Read on Terminal Reader

Ku leha ngopfu; Ku hlaya

Nhlayo na ku hambana ka datha swi hlohlotela ku tlakuka ka tialgorithm ta ML leti rharhanganeke na leti rharhanganeke ku khoma ndzhwalo wa ntirho wa AI.
featured image - Ku Veka Prometheus Alertmanager eka ti-GPU ta Xirhendzevutani xa Vutomi bya ML lexi Antswiweke
Daniel HackerNoon profile picture
0-item

Nhlayo na ku hambana ka datha swi hlohlotela ku tlakuka ka tialgorithm ta ML leti rharhanganeke na leti rharhanganeke ku khoma ndzhwalo wa ntirho wa AI. Tialgorithm leti ti lava leswaku ti-GPU ti tirha hi ndlela leyinene na ku tirhisa tidathaseti ta xikalo xa pulanete ku lemuka swivumbeko. Ti-GPU ti tirha swinene eka ku tisa nkoka na matirhelo hi ku khoma swibalo leswi rharhanganeke, leswi endlaka leswaku xilaveko xa tona xi tlakuka swinene eka domain ya ML/AI.


Hambileswi ti pfunaka, ti-GPU ta durha swinene. Mindzhwalo ya ntirho ya AI/ML ya le henhla yi lava ku langutiwa loku tiyeke na tindlela ta vulawuri ematshan’wini ya ku tirhisiwa ko kongoma ku tlakusa matirhelo na ku tiya loko ku ri karhi ku antswisiwa ku durha. Ku hlengeleta timetriki ta xiyimo xa GPU swi nga pfuna ku humesa mbuyelo wo hlawuleka, ku antswisa xirhendzevutani xa vutomi bya AI/ML.

Timetriki tinharhu ta GPU leti Hlanganisiweke

Mafambelo ya ntirho ya software ya vumbiwa hi switeji swo tala swo fana na ku layicha datha, ku sungula, ku hundzula, na ku tsala datha. Switeji leswi fanaka swi tirha eka ndzhwalo wa ntirho wo dyondza hi muchini lowu nga na swibalo leswi rharhanganeke. Swiphemu swo hambana swa GPU swi tirhisiwa ku hetisisa xilaveko eka xiteji xin’wana na xin’wana. I swa nkoka eka swipano swa vunjhiniyara ku tiva timetriki ta avelo na matirhiselo ya ndzetelo lowu yaka emahlweni na ku tirhisiwa. Leswi swi pfuneta ku teka swiboho leswi nga na vutivi na ku tirhisa 100% wa switirhisiwa eka ku humesiwa ka nkoka lowukulu.


Ku suka eka mavonelo ya dyondzo ya muchini, swiphemu swa GPU leswi landzelaka swi tirhisiwa eka switeji swa mafambelo ya ntirho loko ndzetelo wa modele wu sunguriwa. Hi ta twisisa swiphemu swa GPU na timetriki leti va ti paluxaka. Eku heteleleni, hi ta dyondza ndlela yo ti khoma na ku ti tirhisa ku suka eka Prometheus alertmanager ku antswisa xirhendzevutani xa vutomi bya ML hi ku angarhela.




  1. GPU Memory : Tiphayiphi ta data na ML ti titshege ngopfu hi memory. Eka ku lulamisiwa ka data leyikulu, data yi hlayeriwa eka memori ku kuma mbuyelo wo hatlisa. Swipimelo swa modele, ti gradients, na ti hyperparameters/variables yin’wana swi layichiwile eka memori ya GPU. Ku hlayisa ku tirhisiwa ka memori swi nga pfuna ku pima na ku tlakusa rivilo ra ndzetelo wa modele.
  2. GPU Cores : Loko ti models ti endla matirhelo ya matrix-intensive naswona ti tirhisa ku hundza ka le mahlweni/endzhaku, ti GPU cores ti ta khoma matirhelo lawa. Timetriki ta ntirho wa tensor core ti pfuneta ku kumisisa leswaku tiyuniti ta hardware ti tirhisiwa kahle ku fikela kwihi naswona ti na ndhawu yo antswisa.
  3. SM Clock Frequencies : Matirhelo lawa ya tirhaka eka ti GPU cores ya lava ti streaming multiprocessors (SM) ku endla swibalo swa tinhlayo leswi lavekaka. Tifrekwensi ta wachi ti nga pfuna ku kumisisa rivilo ra xibalo.

Ku khoma Timetriki ta GPU

Ku tirhisa swikripti swa bash hi ku kongoma eka GPU a swi nyiki ku cinca-cinca ka Python. Hi nga vutisa no xopaxopa timetriki ta GPU eka endlelo hinkwaro ra ndzetelo ku twisisa mahanyelo. Hi ku tirhisa timetriki leti, hi nga veka switsundzuxo na ku lulamisa ku khoma swiyimo swa nkoka.


Hi ku tekela enhlokweni ku cinca-cinca na ku andzisiwa, hi ta veka Prometheus ku khwaxa na ku hlohlotela switsundzuxo leswi simekiweke eka threshold na ku tirhisa Python ku khwaxa na ku loga timetriki ta GPU.

1. Timetriki ta Miehleketo

Ku ehleketa leswaku GPU i NVIDIA naswona xiyimiso xa NVIDIA DCGM Exporter xi hetisekile eka GPU ya wena. Hi ta hlamusela Prometheus config ku langutisisa na ku khwaxa timetriki leti simekiweke eka threshold na ku hlohlotela xitiviso xa slack loko threshold yi tlula.


Ku kongomisa eka GPU leyi nyikiweke eka VPC ehansi ka subnet 172.28.61.90 naswona yi paluxiwile eka port 9400 hi ku tirhisa ti config ta Prometheus.


 configs_scrapper: - job_name: 'nvidia_gpu_metrics_scrapper' static_configs: - targets: ['172.28.61.90:9400']


Loko xikongomelo xi hlamuseriwile, hi nga kuma xivulavulelo ku kambela matirhiselo ya memori endzhaku ka timinete tin’wana na tin’wana timbirhi. Loko threshold yi tlula 80%, ku pfuriwa xitsundzuxo xa nkoka.


 groups: -name: GPU_Memory_Alerts rules: - alert: HighGPUMemoryUsage expr: (dcgm_gpu_mem_used / dcgm_gpu_mem_total) * 100 > 80 for: 2m   labels: severity: critical   annotations:summary: "GPU Memory usage is high on {{ $labels.instance }}" description: "GPU memory utilization is over 80% from more than 2 minutes on {{ $labels.instance }}."


Kutani timetriki ti nga rhumeriwa ehenhla tanihi switsundzuxo. Slack yi na tindlela to olova to hlanganisa to veka switsundzuxo. Kutani, hi ku tirhisa config ya YAML leyi nga laha hansi, hi nga endla leswaku ku lemukisa mintlawa kumbe mavito ya vatirhisi ya munhu hi xiyexe eka Slack.


 global: resolve_timeout: 2m route: receiver: 'slack_memory_notifications' group_wait: 5s group_interval: 2m repeat_interval: 1h receivers: - name: 'slack_memory_notifications' slack_configs: - api_url: 'https://databracket.slack.com/services/shrad/webhook/url' channel: 'databracket' username: 'Prometheus_Alertmanager' send_resolved: true title: 'GPU Memory Utilization >80% Alert' text: "A high memory utilization was observed triggering alert on GPU."


Ku veka tihlo na ku lemukisa swa pfuna naswona swi nyika mimpfuno leyi nga nyawuriki, ku fana na ku hi tivisa loko ku ri na leswi hoxeke. Hi fanele ku khoma timetriki ku endla nxopaxopo na ku teka swiboho leswi nga na vutivi.


Eka xiyimo lexi, hi ta languta DuckDB eka vulawuri bya data na boto3 eka ku tirhisiwa ka AWS S3. Hi ku tirhisa layiburari ya vulawuri bya Nvidia (pynvml), hi nga fikelela no lawula GPU hi ku tirhisa khodi. Hi ta tsala timetriki eka S3 tanihi tifayela ta parquet. Ku phikelela, hi ta tsala tilog eka database ya le ndzeni ka memori hi ku tirhisa DuckDB na ku tsala xifaniso xa xihatla xa data eka S3 ku endlela nxopaxopo wa nkarhinyana kumbe wa nkarhi wa xiviri.


 import time import pynvml import duckdb import boto3 import osfrom datetime import datetime pynvml.nvmlInit() s3 = boto3.client('s3') con = duckdb.connect(database=':memory:') con.execute(''' CREATE TABLE gpu_memory_usage ( Timestamp VARCHAR, Overall_memory DOUBLE, Memory_in_use DOUBLE, Available_memory DOUBLE ) ''') def get_gpu_memory_info(gpu_id=0): handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id) memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle) return { "Overall_memory": memory_info.total / (1024 ** 2), "Memory_in_use": memory_info.used / (1024 ** 2), "Available_memory": memory_info.free / (1024 ** 2) } def upload_parquet_to_s3(bucket_name, object_name, file_path): try: s3.upload_file(file_path, bucket_name, object_name) print(f"Parquet file uploaded to S3: {object_name}") except Exception as e: print(f"Failed to upload Parquet to S3: {e}") def log_memory_utilization_to_parquet(bucket_name, filename, gpu_id=0, interval=1.0, local_file_path='gpu_memory_stats.parquet'): try: while True: gpu_memory_info = get_gpu_memory_info(gpu_id) timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S') con.execute(''' INSERT INTO gpu_memory_usage VALUES (?, ?, ?, ?) ''', (timestamp, gpu_memory_info['Overall_memory'], gpu_memory_info['Memory_in_use'], gpu_memory_info['Available_memory'])) print(f"Logged at {timestamp}: {gpu_memory_info}") if int(datetime.now().strftime('%S')) %60 == 0: # Upload Every Minute object_name = f"{filename}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet" upload_parquet_to_s3(bucket_name, object_name, local_file_path) time.sleep(interval) except KeyboardInterrupt: print("Logging stopped by user.") bucket_name = 'prometheus-alerts-cof' filename = 'gpu_memory_stats' log_memory_utilization_to_parquet(bucket_name, filename, gpu_id=0, interval=2.0)


2. Timetriki ta Nkoka ta GPU

Ti GPU ti dume hi ti tensor cores ta tona . Leti i tiyuniti ta hardware leti nga tirhisaka datha ya swiyenge swo tala. I swa nkoka swinene ku twisisa leswaku ti cores ti hangalasa kumbe ku lulamisa njhani ndzhwalo na leswaku ti fika rini eka threshold. Hi nga tirhisa milawu ya auto-scaling hi ku tirhisa switsundzuxo leswi ku khoma ndzhwalo wa ntirho na ku papalata ku hisa ngopfu kumbe ku tshoveka.


Ku fana na ku vekiwa tihlo ka memori, hi ta veka swivumbeko ku langutisisa na ku khoma timetriki ta matirhiselo ya nkoka ya GPU. Eka minete yin’wana na yin’wana, loko ku tirhisiwa ka nkoka ku tlula 80%, xitsundzuxo xa nkoka xi ta rhumeriwa, naswona eka ku tirhisiwa loku ringaneleke, ku ta rhumeriwa mpfuxeto wa xiyimo endzhaku ka timinete tin’wana na tin’wana ta ntlhanu.


 groups: - name: gpu_alerts rules: - alert: HighGPUCoreUtilization expr: gpu_core_utilization > 80 for: 1m labels: severity: critical annotations: summary: "GPU Core Utilization >80% Alert" description: "GPU core utilization is above 80% for over 1 minute." - alert: MediumGPUCoreUtilization expr: gpu_core_utilization > 50 for: 5m labels: severity: warning annotations: summary: "GPU Memory Utilization = Moderate" description: "GPU core utilization is above 50% for over 5 minutes."


Laha, hi ta kuma index ya handle ya xitirhisiwa hi vitana ndlela leyi vuyisaka tihakelo ta ku tirhisiwa. Nhlamulo yi tlhela yi nghenisiwa eka database ya DuckDB ya le ndzeni ka memori ivi yi vekiwa eka bakiti ra s3 hi xitempe xa nkarhi lexi phurosesiweke.


 con.execute(''' CREATE TABLE gpu_core_usage ( Timestamp VARCHAR, GPU_Utilization_Percentage DOUBLE ) ''') def get_gpu_utilization(gpu_id=0): """Returns the GPU utilization percentage.""" handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id) utilization = pynvml.nvmlDeviceGetUtilizationRates(handle) return utilization.gpu def log_gpu_utilization_to_parquet(bucket_name, filename, gpu_id=0, interval=1.0, local_file_path='gpu_core_stats.parquet'): """Logs GPU utilization to a Parquet file and uploads it to S3 periodically.""" try: while True: gpu_utilization = get_gpu_utilization(gpu_id) timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S') con.execute(''' INSERT INTO gpu_core_usage VALUES (?, ?) ''', (timestamp, gpu_utilization)) print(f"Logged at {timestamp}: GPU Utilization = {gpu_utilization}%") if int(datetime.now().strftime('%S')) % 60 == 0: con.execute(f"COPY gpu_core_usage TO '{local_file_path}' (FORMAT PARQUET)") object_name = f"{filename}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet" upload_parquet_to_s3(bucket_name, object_name, local_file_path) con.execute("DELETE FROM gpu_core_usage") time.sleep(interval) except KeyboardInterrupt: print("Logging stopped by user.") # Example usage: bucket_name = 'prometheus-alerts-cof' filename = 'gpu_core_stats' log_gpu_utilization_to_parquet(bucket_name, filename, gpu_id=0, interval=2.0)

3. Timetriki ta Frequency ya Wachi ya SM

Rivilo leri swibalo swi humelelaka ha rona ri fambelana hi ku kongoma na frequency ya clock ya multiprocessor leyi khulukaka. Metric ya frequency ya wachi ya SM yi pfuneta ku kumisisa rivilo leri swibalo swa tensor kumbe ML swi hlohlotelaka na ku hetisisa ha rona.


Hi nga endla leswaku Prometheus yi kota ku hlohlotela switsundzuxo loko frequency ya wachi ya SM yi tlula 2000MHz. Hi nga veka switsundzuxo swa xilemukiso leswaku hi ta tivisiwa loko mpfumawulo wu ri ekusuhi ni mpimo lowu vekiweke.


 groups: - name: gpu_sm_clock_alerts rules: - alert: LowSMClockFrequency expr: gpu_sm_clock_frequency >= 1000 for: 1m labels: severity: warning annotations: summary: "Low SM Clock Frequency" description: "The SM clock frequency is below 500 MHz for over 1 minute." - alert: HighSMClockFrequency expr: gpu_sm_clock_frequency > 2000 for: 1m labels: severity: critical annotations: summary: "High SM Clock Frequency" description: "The SM clock frequency is above 2000 MHz for over 1 minute."


Ku hlengeletiwa ka timetriki ta wachi ya SM ku nga tsariwa hi ku olova ku nghenisa timetriki eka datha ya le ndzeni ka memori na S3.


 con.execute(''' CREATE TABLE sm_clock_usage ( Timestamp VARCHAR, SM_Clock_Frequency_MHz INT ) ''') def get_sm_clock_frequency(gpu_id=0): handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_id) sm_clock = pynvml.nvmlDeviceGetClockInfo(handle, pynvml.NVML_CLOCK_SM) return sm_clock def log_sm_clock_to_parquet(bucket_name, filename, gpu_id=0, interval=1.0, local_file_path='sm_clock_stats.parquet'): try: while True: sm_clock_frequency = get_sm_clock_frequency(gpu_id) timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S') con.execute(''' INSERT INTO sm_clock_usage VALUES (?, ?) ''', (timestamp, sm_clock_frequency)) print(f"Logged at {timestamp}: SM Clock Frequency = {sm_clock_frequency} MHz") if int(datetime.now().strftime('%S')) % 10 == 0: con.execute(f"COPY sm_clock_usage TO '{local_file_path}' (FORMAT PARQUET)") object_name = f"{filename}_{datetime.now().strftime('%Y%m%d_%H%M%S')}.parquet" upload_parquet_to_s3(bucket_name, object_name, local_file_path) con.execute("DELETE FROM sm_clock_usage") time.sleep(interval) except KeyboardInterrupt: print("Logging stopped by user.") bucket_name = 'prometheus-alerts-cof' filename = 'sm_clock_stats' log_sm_clock_to_parquet(bucket_name, filename, gpu_id=0, interval=2.0)


Hi ku tirhisa ti metrics leti vanjhiniyara va ML va nga tiva leswi humelelaka ehansi ka hood. Va nga antswisa endlelo ra ndzetelo hi ku xopaxopa timetriki na ku veka swintshuxo swa switsundzuxo swa nkoka wa le henhla na swa nkoka.

Miehleketo yo Hetelela

Vuleteri bya xikombiso xa dyondzo ya muchini i endlelo leri rharhanganeke no pfilunganyeka. Kufana na handle ka vumbhoni lebyi twalaka na ti stats, swa tika ku gimeta hi leswaku hi tihi ti model variants leti kombisaka vuprofeta lebyi nga na inference yale henhla. Hi lava timetriki ta GPU ku twisisa ndlela leyi xikombiso xa xibalo lexi nga na vutihlamuleri byo khoma na ku tirhisa mindzhwalo ya ntirho ya ML xi tirhaka ha yona. Hi timetriki leti eneleke na switsundzuxo swa nkarhi wa xiviri, swipano swa ML swi nga veka na ku olovisa tiphayiphi ta ML leti tiyeke na ku tiya leti antswisaka xirhendzevutani xa vutomi bya ML hi ku angarhela.