paint-brush
Gukuramo Urubuga Optimisiyoneri: Inama zo Kwihuta, Byorohejena@brightdata
Amateka mashya

Gukuramo Urubuga Optimisiyoneri: Inama zo Kwihuta, Byoroheje

na Bright Data8m2024/11/15
Read on Terminal Reader

Birebire cyane; Gusoma

Kunonosora ibyasomwe kurubuga bikubiyemo gushyira mubikorwa tekinoroji yo gukuramo amakuru neza. Inama zingenzi zirimo gushyira mubikorwa amakosa yo gukemura ibibazo byihuza no kunanirwa kurupapuro, gukoresha ibisubirwamo hamwe na backonable backoff kugirango wirinde kurenza seriveri, no kwandika byoroshye CSS cyangwa XPath byatoranijwe kugirango wirinde gucika hamwe nimpapuro zahinduwe. Kubisibanganya byihuse, koresha HTML yihuta nka lxml hanyuma ukoreshe HTTP / 2 kugirango ugabanye ubukererwe wohereze ibyifuzo byinshi kumurongo umwe. Ubu buhanga bufasha gusiba neza, kuzamura umuvuduko no kwihanganira ibikorwa binini.
featured image - Gukuramo Urubuga Optimisiyoneri: Inama zo Kwihuta, Byoroheje
Bright Data HackerNoon profile picture
0-item

Kwamagana : Iki nigice cya 3 cyibice bitandatu byuruhererekane kuri Advanced Scraping. Gishya kuri uruhererekane? Tangira guhera mu gusoma Igice cya 1 !


Mugice cya 2 cyurubuga rwambere rwo gusiba Urubuga, wize uburyo bwo gusiba amakuru muri SPAs, PWAs, hamwe nimbuga zikoreshwa na AI . Kugeza ubu, ugomba kuba ufite ubumenyi bwose bukenewe kugirango wubake scraper ikora kurubuga rwinshi.


Ni iki gikurikiraho? Igihe cyo gutezimbere scraper yawe hamwe ninama zimwe zo gusiba!

Gusiba Urubuga Byakozwe Byoroshye-Cyangwa Ari?

Kubaka urubuga? Byose bijyanye no kwandika . 👨‍💻


Reka tuvugishe ukuri - niba warigeze kwandika code, uziko kwandika bitagoranye igihe kinini. Imirongo mike hano, a for , no gutera imbere, urimo gusiba amakuru nka pro. Byumvikane neza, sibyo? 😄


Ariko dore ikibazo: ubworoherane bwo kwandika scraper ntoya irashobora kugushukashuka muburyo butari bwo bwumutekano. Kuberiki uhangayikishijwe nibitekerezo bikwiye, gukemura amakosa, ibiti, cyangwa no kwerekana neza mugihe ari imirongo icumi ya code umuntu wese ashobora gusoma?


Ntukeneye ibitekerezo… cyangwa urabikeneye?


Turabibona - ni ukubera iki gukabya ikintu kidakeneye? Gukabya ni umwanzi witerambere. Ariko bigenda bite mugihe ukeneye gupima scrapers yawe kumpapuro nyinshi cyangwa kurubuga rwose? 🤔


Nibwo byihuse-kandi-byanduye, spaghetti-code scraper iguye! 🍝


Abashinzwe iterambere barashobora kurakara iyo ukoze kode yabo


Dore impamvu ukeneye inama zambere zo gusiba urubuga.

Gutezimbere Gusiba Urubuga: Inama 7 Zambere

Ugomba kuba warigeze kumva inama zisanzwe zo gusiba urubuga: shyira imbere page hamwe namakuru akomeye mbere, hitamo ibyifuzo byawe, nibindi. Inama nziza - ariko reka tuvugishe ukuri, ayo mayeri ni amakuru ashaje. 📰


Mugihe urimo ukora ibintu byinshi byateye imbere, ibyo shingiro ntibishobora kugabanywa. Niba koko ushaka kuringaniza umukino wawe wo gusiba, uzakenera gushakisha tekinoroji ikurikira.


Witeguye? Buckle up - igihe kirageze cyo gufata ubuhanga bwawe bwo gusiba urubuga kurwego rukurikira! 💪


Kuburira: Ntugahangayike niba zimwe mu nama zumva ko zimenyerewe-komeza! Hano hari ubushishozi bwinshi bushimishije mugihe wibira cyane! 🤿

Shyira mu bikorwa Ikosa

Rimwe mu makosa akunze kugaragara mugusiba urubuga nukwibagirwa ko interineti itari tekinoroji yubumaji, idakuka. Iyo wohereje icyifuzo kurubuga, ibintu byose birashobora (kandi bizashoboka, mugihe runaka) bigenda nabi. ❌


Reka turebe ibintu bimwe bisanzwe:

  • Wi-Fi yawe cyangwa umurongo wawe birashobora guhagarara mugihe gito

  • Seriveri yakira urubuga irashobora kutaboneka

  • Urupapuro urimo gushaka ntirushobora kubaho

  • Urubuga rugenewe rushobora kuba rufite umuvuduko wigihe gito, biganisha ku ikosa ryigihe


Noneho, vanga mu gusesengura amakuru, gutunganya mbere, no kohereza mu bubiko, kandi ufite uburyo bwiza bwo guteza akaduruvayo. 💥


Ongeraho ibintu byose byakuweho kuvanga


None, igisubizo ni ikihe? Gukemura amakosa ! 🛡️


Gukemura amakosa ninshuti yawe magara mugusiba urubuga. Inyandiko yawe irashobora gutunganya amapaji menshi (cyangwa ibihumbi), kandi ikosa rimwe ntirishobora kuzana ibikorwa byawe byose.


Wibuke ko try ... catch blok ninshuti yawe. Koresha kugirango urangize ibyifuzo byawe no gutunganya logique. Kandi, uzirikane ko amasomero menshi ya HTTP adatezimbere kubisubizo bibi bya HTTP (nka 404 cyangwa 500 ). 😲


Niba utamenyereye kode yimiterere ya HTTP , reba videwo ikurikira:


Kurugero, mubitabo byifuzo bya Python ukeneye kugenzura intoki kugenzura ibisubizo byimiterere nkibi bikurikira:

 import requests response = requests.get("https://example.com") if response.status_code == 200: # handle the successful response... else: # handle the error response...



Cyangwa, kimwe, koresha uburyo bwo kuzamura_ibikorwa () :

 import requests try: response = requests.get("https://example.com") # raises an HTTPError for bad responses (4xx or 5xx) response.raise_for_status() # handle the successful response... except requests.exceptions.HTTPError as http_err: # handle an HTTP error... except requests.exceptions.RequestException as req_err: # handle a request error...

Kunanirwa gukira hamwe no gusaba gusubiramo

Urubuga rwawe rwambere rwo gusiba inyandiko ntirukwiye gusa gukemura amakosa ahubwo no gukira muri zo. Kubera ko amakosa menshi ajyanye no gusiba urubuga afitanye isano no gusaba urubuga, urashobora kunoza cyane imikorere ya scraper yawe ushyira mubikorwa ibyifuzo byongeye .


Igitekerezo kiroroshye: niba icyifuzo cyatsinzwe, urongera ukagerageza - rimwe, bibiri, bitatu, cyangwa inshuro nyinshi - kugeza bigenze neza. 🔄


Ariko dore gufata: kubera ko imwe mumpamvu zikunze kugaragara kubisabwa byananiranye ni seriveri igamije kuba hasi cyangwa gutinda, ntushaka kubirenza wohereza icyifuzo kimwe inshuro nyinshi mugihe gito.


Ku bw'amahirwe, ntabwo aribyo bigoye…


Niba icyifuzo cyatsinzwe nonaha, birashoboka ko byananirana ako kanya. Aho niho hagaragara gusubira inyuma !


Aho kugirango ugerageze ako kanya, ubu buhanga bwongera buhoro buhoro umwanya hagati yisubiramo, ukongerera amahirwe yo gutsinda utanga intego ya seriveri yo gukira. ⏳


Mugihe ushobora gukoresha intoki ingamba zoroshye zo gusubiramo hamwe na kode yihariye, abakiriya benshi ba HTTP baza bafite ibikoresho byubatswe cyangwa amasomero kugirango bakore ibisubizo byikora. Kurugero, Axios itanga isomero-isubiramo isomero, ushobora gukoresha gutya:


 const axios = require("axios"); const axiosRetry = require("axios-retry"); axiosRetry(axios, { retries: 3, retryDelay: axiosRetry.exponentialDelay }); axios.get('https://example.com') .then(response => console.log(response.data)) .catch(error => console.log("Request failed:", error));


Muri ubwo buryo, pake ya urllib3 ya Python izanye na Retry class ihuza nta nkomyi nabakiriya benshi ba Python HTTP .

Andika Abatoranya Rusange

Mugihe ugenzura ibintu muri DevTools, urashobora gutwarwa no gukanda iburyo hanyuma ugahitamo "Gukoporora uwatoranije":


Gukoporora uwatoranije muri DevTools


Ariko uraburiwe, ibisubizo birashobora kugaragara nkibi:

 #__next > div > main > div.sc-d7dc08c8-0.fGqCtJ > div.sc-93e186d7-0.eROqxA > h1


Ibyo rwose ntabwo ari byiza gusiba urubuga….


Oh, noooo!


Ikibazo? Birenzeho abatoranya nkaba barashobora gucika byoroshye mugihe page yimiterere ihindutse. Ibisobanuro birambuye uwaguhisemo, niko bigenda byoroha.


Kugirango urubuga rwawe rusibangane kurushaho, ugomba gukomeza guhitamo. Aho kwishingikiriza kumasomo ajyanye namasomo (ahindura igihe cyose), wibande kubiranga bidashoboka guhinduka, nka id , data- , cyangwa aria- . Ibyinshi muri ibyo biranga bigenewe kugerageza no kugerwaho , bityo bikunda kuguma bihamye mugihe runaka. 💡


Mugihe abatoranya CSS byoroshye gusoma no kubyumva, XPath itanga imbaraga nyinshi. Ariko ntugahangayike - urashobora akenshi kugera kubisubizo bimwe hamwe nabatoranijwe ba CSS byoroshye, bikagukiza gukenera code ya XPath igoye. 😌


Kubindi bisobanuro kuri ibyo, reba ku buyobozi bwacu ku bahitamo XPath vs CSS !

Hitamo HTML yihuta

Kugereranya paji ya HTML bisaba igihe nubutunzi, cyane cyane niba urimo ukorana na DOM nini. Niba scraper yawe irimo kugereranya impapuro nkeya, ntabwo arikintu kinini.


Noneho, bigenda bite mugihe ibikorwa byawe byo gusiba byapimye kandi ugomba gukura amakuru kuva miriyoni yimpapuro? Ako kantu gato gashobora guhita gahindura ibikoresho bya seriveri hanyuma ukongeramo amasaha kumwanya wawe wose wo gusiba. ⏳


Kugira ngo ubone ibisobanuro byimbitse, reba aya makuru:

Urashaka kugereranya byuzuye? Soma ingingo yacu kubisobanuro byiza bya HTML .


Ubutumwa bwiza? Guhindura kuva muri pariseri ujya mubindi ntabwo bigoye. Kurugero, muri BeautifulSoup , ni ibintu byoroshye guhinduka:

 from bs4 import BeautifulSoup # or using html.parser soup = BeautifulSoup(html_content, "html.parser") # or using lxml parser soup = BeautifulSoup(html_content, "lxml")


Bite ho kuri pariseri ya HTML yubatswe muri mushakisha nka Chrome? 🤔


Shakisha byinshi muri videwo ikurikira:

Koresha HTTP / 2 kubisabwa byihuse

HTTP / 2 ni verisiyo ivuguruye ya HTTP yemerera ibyifuzo byinshi kumurongo umwe. Ibi bigabanya ubukererwe kandi birashobora kunoza imikorere rusange yumurimo wawe wo gusiba.


Kugenzura niba urubuga rushyigikira HTTP / 2, fungura gusa DevTools muri mushakisha yawe, jya kuri tab ya "Network", hanyuma urebe inkingi ya "Protocole" - niba ivuga h2 , urubuga rukoresha HTTP / 2:


google.com ikoresha HTTP / 2


Kubwamahirwe, ntabwo abakiriya ba HTTP bose hamwe nububiko bwibitabo busiba HTTP / 2. Ariko, ibikoresho nka HTTPX kuri Python bitanga inkunga yuzuye kuri HTTP / 2 .

Igikorwa cyo Kuringaniza

Gusiba kurubuga ahanini ni umurimo I / O uhuza - wohereza ibyifuzo kuri seriveri, gutegereza igisubizo, gutunganya amakuru, no gusubiramo. Mugihe cyo gutegereza, scraper yawe mubusanzwe ntacyo ikora, idakora neza.


Umuvuduko scraper yawe itunganya ibyifuzo


Igisubizo? Kuringaniza cyangwa guhuriza hamwe !


Kohereza ibyifuzo byinshi icyarimwe, urashobora kugabanya ibyo bihe byapfuye kandi ugahindura imikoreshereze y'urusobe.


🚨 Ariko witonde! 🚨


Bombarding ya seriveri hamwe nibisabwa byinshi icyarimwe birashobora kuganisha ku kugabanya igipimo cyangwa guhagarika IP yawe - ingamba ebyiri zizwi cyane zo kurwanya ibicuruzwa . 😬


Impanuro : Urashobora kandi guhuza imirimo yo gusesengura, cyane cyane niba ukoresha CPU nyinshi, bizihutisha uburyo bwo gukuramo amakuru. ⚡

Emera AI ishingiye kuri Adaptive Algorithms

Imiterere ya AI ishingiye ku guhuza imiterere ya algorithms bigira ku miterere yamakuru no mu rupapuro rwa HTML, guhindura imyitwarire yabo mugihe nyacyo kugirango igume hejuru yimpinduka. 😮


Numukino uhindura umukino wo gusiba urubuga! 🤯


Iyo imbuga za interineti zivugurura imiterere yazo cyangwa zigashyiraho ingamba zo kurwanya bot, izi algorithm zirashobora guhinduka vuba, bigatuma scraper yawe ikomeza kugenda neza. 🧠


Muri make, bakora ibisakuzo neza, bigufasha gukuramo amakuru neza - nubwo urubuga rutera imipira itunguranye. ⚾ Hamwe na algorithms yo guhuza n'imiterere, ni nko kugira scraper igenda ihinduka mugihe!


Wige byinshi mu gice cya 4 cyiyi videwo na Forrest Knight:

Igikoresho Cyiza cyo Gukuramo Urubuga

Nukuri, inama zose nuburiganya twavuze kugeza ubu birashobora gutuma scraper yawe yihuta, yizewe, ikomeye, kandi ikora neza. Ariko reka tube abanyakuri - nabo bazana ibintu byinshi bigoye. 😅


Amakuru meza nuko amasomo menshi akoreshwa mubwinshi bwimishinga yo gusiba. Noneho, aho kugirango wandike ibintu byose uhereye ku ntangiriro, ushobora gukoresha imirimo yubatswe mbere kugirango ukemure imirimo yihariye. Nibyo rwose ibyo Bright Data's Scraping Imikorere itanga!


Hamwe nimikorere ya JavaScript 73+, abakoresha bubatse scrapers zirenga 38K zikorera mubihugu 195+. Iyo ni toni yimbaraga zo gusiba! 📈


Ihute iterambere ryawe hamwe nibidukikije byateguwe kugirango bisibe, gufungura, no gupima amakuru y'urubuga bitagoranye:

Ibitekerezo byanyuma

Noneho uzi kuringaniza scraper yawe hamwe nubushishozi bwabashakashatsi bafite uburambe!


Wibuke ko iki ari igice cya 3 gusa, nuko tugeze hagati yurugendo rwibice bitandatu tujya kurubuga rwo hejuru! Komeza umukandara ukenye kuko tugiye kwibira muburyo bugezweho bwa tekinoroji, ibisubizo byubwenge, hamwe ninama zimbere.


Guhagarara ahakurikira? Gukoresha imbaraga zo kuyobora porokisi ya AI! 🌐

L O A D I N G
. . . comments & more!

About Author

Bright Data HackerNoon profile picture
Bright Data@brightdata
From data collection to ready-made datasets, Bright Data allows you to retrieve the data that matters.

HANG TAGS

IYI ngingo YATANZWE MU...