❗ Umshwana wokuzihlangula : Lena Ingxenye 3 yochungechunge lwethu lwezingxenye eziyisithupha ku-Advanced Web Scraping. Umusha ochungechungeni? Qala kusukela ekuqaleni ngokufunda Ingxenye 1 !





Engxenyeni yesi-2 yochungechunge lwethu Lwe-Web Scraping Ethuthukisiwe, ufunde ukusula idatha kuma-SPA, ama-PWAs, kanye namasayithi anikwe amandla yi-AI . Njengamanje, kufanele ube nalo lonke ulwazi oludingekayo ukuze wakhe i-scraper esebenza ngokumelene namawebhusayithi amaningi esimanje.





Yini elandelayo? Isikhathi sokuthuthukisa i-scraper yakho ngamanye amathiphu namaqhinga okuklwebha!

I-Web Scraping Yenziwe Yalula—Noma Ingabe Injalo?

Nakhu ukuthi kungani udinga amathiphu athuthukile wokukhuhla iwebhu.

Ukuthuthukisa i-Web Scraping: Amathiphu aphezulu angu-7 namasu

⚠️ Isexwayiso: Ungakhathazeki uma amanye amathiphu uzizwa uwajwayele—qhubeka! Kunemininingwane eminingi ethokozisayo njengoba ujula! 🤿

Sebenzisa Ukuphatha Iphutha

Elinye lamaphutha ajwayeleke kakhulu ekukhunjweni kwewebhu ukukhohlwa ukuthi i-inthanethi akubona ubuchwepheshe obuyimilingo, obungenaphutha. Uma uthumela isicelo kusayithi, lonke uhla lwezinto lunga (futhi, ngesikhathi esithile) lungahambi kahle. ❌





Ake sibheke ezinye izimo ezijwayelekile:

I-Wi-Fi yakho noma ukuxhumeka kwakho kungase kuphazamise isikhashana

Iseva esingethe iwebhusayithi kungenzeka ingatholakali

Ikhasi olifunayo kungenzeka alisekho

Isayithi okuhlosiwe kungenzeka libhekene nokwehla kwesikhashana, okuholela ephuthani lokuphela kwesikhathi



Ngakho, liyini ikhambi? Ukuphatha iphutha ! 🛡️





Ukuphatha iphutha umngane wakho omkhulu ekukhunjweni kwewebhu. Umbhalo wakho cishe uzocubungula inqwaba (noma izinkulungwane) zamakhasi, futhi iphutha elilodwa akufanele lehlise wonke umsebenzi wakho.





Khumbula ukuthi i try ... catch block ingumngane wakho. Yisebenzise ukugoqa izicelo zakho kanye nokucubungula okunengqondo. Futhi, khumbula ukuthi imitapo yolwazi eminingi ye-HTTP ayiphakamisi okuhlukile kuzimpendulo ezimbi ze-HTTP (njenge- 404 noma 500 ). 😲





Uma ungawazi amakhodi esimo se-HTTP , bona ividiyo engezansi:





Isibonelo, kumtapo wezincwadi wezicelo wePython udinga ukuhlola ngokwakho ikhodi yesimo sokuphendula ngale ndlela elandelayo:

import requests response = requests.get("https://example.com") if response.status_code == 200: # handle the successful response... else: # handle the error response...









Noma, ngokufanayo, sebenzisa indlela yokuphakamisa_for_status() :

import requests try: response = requests.get("https://example.com") # raises an HTTPError for bad responses (4xx or 5xx) response.raise_for_status() # handle the successful response... except requests.exceptions.HTTPError as http_err: # handle an HTTP error... except requests.exceptions.RequestException as req_err: # handle a request error...

Ukwehluleka Ukutholwa Ngokuphinda Kwesicelo

Iskripthi sakho esithuthukisiwe se-web scraping akufanele nje kuphela sikwazi ukuphatha amaphutha kodwa futhi silulame kuwo. Njengoba amaphutha amaningi ahlobene ne-web scraping aboshelwe ekwenzeni izicelo zewebhu, ungathuthukisa kakhulu ukusebenza kwe-scraper yakho ngokufaka izicelo ezingazanywa kabusha .





Umqondo ulula: uma isicelo sehluleka, uyazama futhi—isikhathi esisodwa, ezimbili, ezintathu, noma ngaphezulu—kuze kube yilapho siphumelela. 🔄





Uma isicelo sehluleka manje, kungenzeka sihluleke futhi ngokushesha. Kulapho-ke i-backoff ye-exponential iqala khona ukusebenza!





Esikhundleni sokuzama futhi ngokushesha, le nqubo inyusa kancane kancane isikhathi phakathi kokuzama futhi, ithuthukisa amathuba akho empumelelo ngokunikeza iseva eqondiwe isikhathi sokululama. ⏳





Nakuba ungakwazi ukwenza amasu okuzama kabusha ngesandla alula ngekhodi yangokwezifiso, amaklayenti amaningi e-HTTP eza nezinsiza ezakhelwe ngaphakathi noma imitapo yolwazi ukuze isingathe ukuzama futhi ngokuzenzakalelayo. Isibonelo, i-Axios inikezela ngomtapo wolwazi we-axios-retry , ongawusebenzisa kanje:





const axios = require("axios"); const axiosRetry = require("axios-retry"); axiosRetry(axios, { retries: 3, retryDelay: axiosRetry.exponentialDelay }); axios.get('https://example.com') .then(response => console.log(response.data)) .catch(error => console.log("Request failed:", error));





Ngokufanayo, iphakheji le-Python's urllib3 liza nekilasi elithi Zama futhi elihlanganisa ngaphandle komthungo namaklayenti amaningi e-Python HTTP .

Bhala Izikhethi Ezijwayelekile

Lapho uhlola izinto ku-DevTools, ungase ulingeke ukuba uchofoze kwesokudla bese ukhetha inketho ethi "Kopisha isikhethi":









Kodwa xwayiswa, umphumela ungase ubukeke kanje:

#__next > div > main > div.sc-d7dc08c8-0.fGqCtJ > div.sc-93e186d7-0.eROqxA > h1





Inkinga? Izikhethi ezicaciswe ngokweqile ezifana nalezi zingaphuka kalula uma isakhiwo sekhasi sishintsha. Lapho isikhethi sakho sinemininingwane eminingi, siba ntekenteke kakhulu.





Ukuze wenze ukuklwebha kwakho kwewebhu kuqine, kufanele ugcine abakhethi bakho bevumelana nezimo. Esikhundleni sokuthembela kumakilasi ahlobene nesitayela (ashintsha ngaso sonke isikhathi), gxila kuzibaluli okungenzeka zishintshe, njenge id , data- , noma aria- . Iningi lalezo zichasiso zenzelwe ukuhlolwa nokufinyeleleka , ngakho-ke zivame ukuhlala zingashintshile ngokuhamba kwesikhathi. 💡





Futhi nakuba izikhethi ze-CSS kulula ukuzifunda nokuqonda, i-XPath inikeza amandla engeziwe. Kodwa ungakhathazeki—ngokuvamile ungazuza imiphumela efanayo ngezikhethi ezilula ze-CSS, zikusindisa ekudingeni ikhodi eyinkimbinkimbi ye-XPath. 😌





Ukuze uthole ulwazi olwengeziwe ngalokho, bheka umhlahlandlela wethu mayelana nabakhethi be-XPath vs CSS !

Khetha Abahlaziyi be-HTML Esheshayo

Ukuhlaziya amakhasi e-HTML kuthatha isikhathi nezinsiza, ikakhulukazi uma usebenzisana ne-DOM enkulu, efakwe esidlekeni. Uma i-scraper yakho ihlaziya amakhasi ambalwa kuphela, akuyona into enkulu kangako.





Ukuze uthole ukuqonda okujulile, bheka lezi zinsiza:

Ufuna isiqhathaniso esigcwele? Funda isihloko sethu ngabahlaluli be-HTML abahamba phambili .





Izindaba ezinhle? Ukushintsha usuka kwesinye umhlangani uye kwesinye akunzima kangako. Isibonelo, ku -BeautifulSoup , kuwushintsho olulula lwepharamitha:

from bs4 import BeautifulSoup # or using html.parser soup = BeautifulSoup(html_content, "html.parser") # or using lxml parser soup = BeautifulSoup(html_content, "lxml")





Futhi kuthiwani ngabahlahli be-HTML bakhelwe kuziphequluli ezifana ne-Chrome? 🤔





Thola okwengeziwe kuvidiyo engezansi:

Bopha i-HTTP/2 ngezicelo ezisheshayo

I-HTTP/2 inguqulo ebuyekeziwe ye-HTTP evumela izicelo eziningi ngoxhumano olulodwa. Lokhu kunciphisa ukubambezeleka futhi kungathuthukisa ukusebenza okuphelele komsebenzi wakho wokukhuhla.





Ukuhlola ukuthi isayithi liyayisekela yini i-HTTP/2, vele uvule i-DevTools esipheqululini sakho, uye kuthebhu ethi “Inethiwekhi”, bese ubheka ikholomu ethi “Protocol”—uma ithi h2 , isayithi lisebenzisa i-HTTP/2:









Ngeshwa, akuwona wonke amaklayenti e-HTTP namalabhulali e-scraping asekela i-HTTP/2. Nokho, amathuluzi afana ne -HTTPX yePython anikeza ukusekelwa okugcwele kwe-HTTP/2 .

Ukufana komsebenzi

Isixazululo? Ukufana noma ukuhambisana !





Ngokuthumela izicelo eziningi ngesikhathi esisodwa, unganciphisa lezo zikhathi ezifile futhi wandise ukusetshenziswa kwenethiwekhi.





Ithiphu yochwepheshe : Ungakwazi futhi ukufanisa imisebenzi yokudlulisa, ikakhulukazi uma usebenzisa ama-CPU amaningi, okuzosheshisa inqubo yokukhipha idatha. ⚡

Yamukela i-AI-based Adaptive Algorithms

Funda kabanzi eSahlukweni 4 sale vidiyo kaForrest Knight:

Ithuluzi Elingcono Kakhulu Lokwenza I-Web Scraping Optimization

Imicabango yokugcina

Isitobhi esilandelayo? Ukusebenzisa amandla okuphathwa kommeleli oshayelwa yi-AI! 🌐