❗ Umshwana wokuzihlangula : Lena Ingxenye 3 yochungechunge lwethu lwezingxenye eziyisithupha ku-Advanced Web Scraping. Umusha ochungechungeni? Qala kusukela ekuqaleni ngokufunda Ingxenye 1 !
Engxenyeni yesi-2 yochungechunge lwethu Lwe-Web Scraping Ethuthukisiwe, ufunde ukusula idatha kuma-SPA, ama-PWAs, kanye namasayithi anikwe amandla yi-AI . Njengamanje, kufanele ube nalo lonke ulwazi oludingekayo ukuze wakhe i-scraper esebenza ngokumelene namawebhusayithi amaningi esimanje.
Yini elandelayo? Isikhathi sokuthuthukisa i-scraper yakho ngamanye amathiphu namaqhinga okuklwebha!
Ukwakha i-web scraper? Konke kumayelana nokubhala . 👨💻
Futhi masikhulume iqiniso—uma uke wabhala ikhodi, uyazi ukubhala akulula kangako isikhathi esiningi. Imigqa embalwa lapha, for
lapho, kanye ne-boom, ukhipha idatha njengochwepheshe. Kuzwakala kulula, akunjalo? 😄
Kodwa nansi inkinga: ubulula bokubhala i-scraper encane bungakwenza ube nomuzwa wokulondeka ongamanga. Kungani uzihlupha ngamazwana afanelekile, ukuphatha amaphutha, amalogi, noma ukuhlehlisa kahle uma kuyimigqa yekhodi eyishumi nambili engafundwa yinoma ubani?
Siyayithola—kungani i-verengineer into engayidingi? I-Overengineering isitha senqubekelaphambili. Kodwa kwenzekani uma udinga ukukala ama-scrapers akho kumakhasi amaningi noma amasayithi wonke? 🤔
Yilapho isikhihli sakho esisheshayo nesingcolile, esinekhodi ye-spaghetti sihlukana khona! 🍝
Nakhu ukuthi kungani udinga amathiphu athuthukile wokukhuhla iwebhu.
Kufanele ukuthi usuwazwile kakade amathiphu ajwayelekile okuklwebha iwebhu: beka phambili amakhasi ngedatha ebalulekile kuqala, hlela izicelo zakho ngokungahleliwe, njalo njalo. Iseluleko esihle—kodwa masikhulume iqiniso, lawo maqhinga yizindaba ezindala. 📰
Uma ubhekene nezimo ezithuthuke kakhulu, lezo zisekelo zingase zingakunqamuli. Uma ufuna ngempela ukukhulisa igeyimu yakho yokuklwebha, uzodinga ukuhlola amasu eleveli elandelayo.
Ulungile? Bopha—sekuyisikhathi sokuyisa amakhono akho okukhuhla iwebhu ezingeni elilandelayo! 💪
⚠️ Isexwayiso: Ungakhathazeki uma amanye amathiphu uzizwa uwajwayele—qhubeka! Kunemininingwane eminingi ethokozisayo njengoba ujula! 🤿
Elinye lamaphutha ajwayeleke kakhulu ekukhunjweni kwewebhu ukukhohlwa ukuthi i-inthanethi akubona ubuchwepheshe obuyimilingo, obungenaphutha. Uma uthumela isicelo kusayithi, lonke uhla lwezinto lunga (futhi, ngesikhathi esithile) lungahambi kahle. ❌
Ake sibheke ezinye izimo ezijwayelekile:
I-Wi-Fi yakho noma ukuxhumeka kwakho kungase kuphazamise isikhashana
Iseva esingethe iwebhusayithi kungenzeka ingatholakali
Ikhasi olifunayo kungenzeka alisekho
Isayithi okuhlosiwe kungenzeka libhekene nokwehla kwesikhashana, okuholela ephuthani lokuphela kwesikhathi
Manje, hlanganisa ukucozulula idatha, ukucubungula kusengaphambili, kanye nokuthekelisa kusizindalwazi, futhi usuthole iresiphi ephelele yezinxushunxushu. 💥
Ngakho, liyini ikhambi? Ukuphatha iphutha ! 🛡️
Ukuphatha iphutha umngane wakho omkhulu ekukhunjweni kwewebhu. Umbhalo wakho cishe uzocubungula inqwaba (noma izinkulungwane) zamakhasi, futhi iphutha elilodwa akufanele lehlise wonke umsebenzi wakho.
Khumbula ukuthi i try ... catch
block ingumngane wakho. Yisebenzise ukugoqa izicelo zakho kanye nokucubungula okunengqondo. Futhi, khumbula ukuthi imitapo yolwazi eminingi ye-HTTP ayiphakamisi okuhlukile kuzimpendulo ezimbi ze-HTTP (njenge- 404
noma 500
). 😲
Uma ungawazi amakhodi esimo se-HTTP , bona ividiyo engezansi:
Isibonelo, kumtapo wezincwadi wezicelo wePython udinga ukuhlola ngokwakho ikhodi yesimo sokuphendula ngale ndlela elandelayo:
import requests response = requests.get("https://example.com") if response.status_code == 200: # handle the successful response... else: # handle the error response...
Noma, ngokufanayo, sebenzisa indlela yokuphakamisa_for_status() :
import requests try: response = requests.get("https://example.com") # raises an HTTPError for bad responses (4xx or 5xx) response.raise_for_status() # handle the successful response... except requests.exceptions.HTTPError as http_err: # handle an HTTP error... except requests.exceptions.RequestException as req_err: # handle a request error...
Iskripthi sakho esithuthukisiwe se-web scraping akufanele nje kuphela sikwazi ukuphatha amaphutha kodwa futhi silulame kuwo. Njengoba amaphutha amaningi ahlobene ne-web scraping aboshelwe ekwenzeni izicelo zewebhu, ungathuthukisa kakhulu ukusebenza kwe-scraper yakho ngokufaka izicelo ezingazanywa kabusha .
Umqondo ulula: uma isicelo sehluleka, uyazama futhi—isikhathi esisodwa, ezimbili, ezintathu, noma ngaphezulu—kuze kube yilapho siphumelela. 🔄
Kodwa nakhu okubambekayo: njengoba esinye sezizathu ezivame kakhulu zesicelo esihlulekile ukuthi iseva eqondiwe iphansi noma ihamba kancane, awufuni ukuyiqeda amandla ngokuthumela isicelo esifanayo ngokuphindaphindiwe esikhathini esifushane.
Uma isicelo sehluleka manje, kungenzeka sihluleke futhi ngokushesha. Kulapho-ke i-backoff ye-exponential iqala khona ukusebenza!
Esikhundleni sokuzama futhi ngokushesha, le nqubo inyusa kancane kancane isikhathi phakathi kokuzama futhi, ithuthukisa amathuba akho empumelelo ngokunikeza iseva eqondiwe isikhathi sokululama. ⏳
Nakuba ungakwazi ukwenza amasu okuzama kabusha ngesandla alula ngekhodi yangokwezifiso, amaklayenti amaningi e-HTTP eza nezinsiza ezakhelwe ngaphakathi noma imitapo yolwazi ukuze isingathe ukuzama futhi ngokuzenzakalelayo. Isibonelo, i-Axios inikezela ngomtapo wolwazi we-axios-retry , ongawusebenzisa kanje:
const axios = require("axios"); const axiosRetry = require("axios-retry"); axiosRetry(axios, { retries: 3, retryDelay: axiosRetry.exponentialDelay }); axios.get('https://example.com') .then(response => console.log(response.data)) .catch(error => console.log("Request failed:", error));
Ngokufanayo, iphakheji le-Python's urllib3
liza nekilasi elithi Zama futhi elihlanganisa ngaphandle komthungo namaklayenti amaningi e-Python HTTP .
Lapho uhlola izinto ku-DevTools, ungase ulingeke ukuba uchofoze kwesokudla bese ukhetha inketho ethi "Kopisha isikhethi":
Kodwa xwayiswa, umphumela ungase ubukeke kanje:
#__next > div > main > div.sc-d7dc08c8-0.fGqCtJ > div.sc-93e186d7-0.eROqxA > h1
Lokho akulungile neze ukukhuhla iwebhu….
Inkinga? Izikhethi ezicaciswe ngokweqile ezifana nalezi zingaphuka kalula uma isakhiwo sekhasi sishintsha. Lapho isikhethi sakho sinemininingwane eminingi, siba ntekenteke kakhulu.
Ukuze wenze ukuklwebha kwakho kwewebhu kuqine, kufanele ugcine abakhethi bakho bevumelana nezimo. Esikhundleni sokuthembela kumakilasi ahlobene nesitayela (ashintsha ngaso sonke isikhathi), gxila kuzibaluli okungenzeka zishintshe, njenge id
, data-
, noma aria-
. Iningi lalezo zichasiso zenzelwe ukuhlolwa nokufinyeleleka , ngakho-ke zivame ukuhlala zingashintshile ngokuhamba kwesikhathi. 💡
Futhi nakuba izikhethi ze-CSS kulula ukuzifunda nokuqonda, i-XPath inikeza amandla engeziwe. Kodwa ungakhathazeki—ngokuvamile ungazuza imiphumela efanayo ngezikhethi ezilula ze-CSS, zikusindisa ekudingeni ikhodi eyinkimbinkimbi ye-XPath. 😌
Ukuze uthole ulwazi olwengeziwe ngalokho, bheka umhlahlandlela wethu mayelana nabakhethi be-XPath vs CSS !
Ukuhlaziya amakhasi e-HTML kuthatha isikhathi nezinsiza, ikakhulukazi uma usebenzisana ne-DOM enkulu, efakwe esidlekeni. Uma i-scraper yakho ihlaziya amakhasi ambalwa kuphela, akuyona into enkulu kangako.
Manje, kwenzekani uma umsebenzi wakho wokukrwela ukhuphuka futhi kufanele ubuyise idatha ezigidini zamakhasi? Leyo nhloko encane ingadonsa ngokushesha izinsiza zeseva futhi yengeze amahora kusikhathi sakho sokukrwela esiphelele. ⏳
Ukuze uthole ukuqonda okujulile, bheka lezi zinsiza:
Ufuna isiqhathaniso esigcwele? Funda isihloko sethu ngabahlaluli be-HTML abahamba phambili .
Izindaba ezinhle? Ukushintsha usuka kwesinye umhlangani uye kwesinye akunzima kangako. Isibonelo, ku -BeautifulSoup , kuwushintsho olulula lwepharamitha:
from bs4 import BeautifulSoup # or using html.parser soup = BeautifulSoup(html_content, "html.parser") # or using lxml parser soup = BeautifulSoup(html_content, "lxml")
Futhi kuthiwani ngabahlahli be-HTML bakhelwe kuziphequluli ezifana ne-Chrome? 🤔
Thola okwengeziwe kuvidiyo engezansi:
I-HTTP/2 inguqulo ebuyekeziwe ye-HTTP evumela izicelo eziningi ngoxhumano olulodwa. Lokhu kunciphisa ukubambezeleka futhi kungathuthukisa ukusebenza okuphelele komsebenzi wakho wokukhuhla.
Ukuhlola ukuthi isayithi liyayisekela yini i-HTTP/2, vele uvule i-DevTools esipheqululini sakho, uye kuthebhu ethi “Inethiwekhi”, bese ubheka ikholomu ethi “Protocol”—uma ithi h2
, isayithi lisebenzisa i-HTTP/2:
Ngeshwa, akuwona wonke amaklayenti e-HTTP namalabhulali e-scraping asekela i-HTTP/2. Nokho, amathuluzi afana ne -HTTPX yePython anikeza ukusekelwa okugcwele kwe-HTTP/2 .
I-Web scraping ngokuvamile iwumsebenzi we-I/O -uthumela izicelo kuseva, ulinde impendulo, ucubungule idatha, futhi uphinde. Ngesikhathi sokulinda, i-scraper yakho empeleni ayisebenzi, okungasebenzi kahle.
Isixazululo? Ukufana noma ukuhambisana !
Ngokuthumela izicelo eziningi ngesikhathi esisodwa, unganciphisa lezo zikhathi ezifile futhi wandise ukusetshenziswa kwenethiwekhi.
🚨 Kodwa qaphela! 🚨
Ukuqhumisa iseva ngezicelo eziningi kakhulu ngesikhathi esisodwa kungaholela ekukhawulweni kwezinga noma ekuvinjweni kwe-IP yakho— izinyathelo ezimbili ezidumile zokulwa nokukhuhla . 😬
Ithiphu yochwepheshe : Ungakwazi futhi ukufanisa imisebenzi yokudlulisa, ikakhulukazi uma usebenzisa ama-CPU amaningi, okuzosheshisa inqubo yokukhipha idatha. ⚡
Ama-algorithms asekelwe ku-AI afunda kumaphethini edatha nezakhiwo zekhasi le-HTML, alungise ukuziphatha kwawo ngesikhathi sangempela ukuze ahlale phezu kwezinguquko. 😮
Lokho kushintshile umdlalo wokukhuhla iwebhu! 🤯
Lapho amawebhusayithi ebuyekeza ukwakheka kwawo noma esebenzisa izinyathelo zokulwa ne-bot, lawa ma-algorithms angashintsha ngokushesha, aqinisekise ukuthi i-scraper yakho iqhubeka isebenza kahle. 🧠
Ngamafuphi, benza ama-scrapers ahlakaniphe, akusize ukhiphe idatha ngokuphumelelayo-ngisho nalapho isayithi liphonsa ama-curveballs angalindelekile. ⚾ Ngama-algorithms aguquguqukayo, kufana nokuba ne-scraper eguqukayo ngokuhamba kwesikhathi!
Funda kabanzi eSahlukweni 4 sale vidiyo kaForrest Knight:
Impela, wonke amathiphu namasu esiwabalulile kuze kube manje angenza i-scraper yakho isheshe, ithembeke kakhulu, iqine, futhi isebenze kahle. Kodwa masibe ngokoqobo—futhi ziletha ubunkimbinkimbi obuningi. 😅
Izindaba ezinhle ukuthi eziningi zalezi zifundo zisebenza kuningi lamaphrojekthi wokukhuhla. Ngakho-ke, esikhundleni sokufaka amakhodi yonke into kusukela ekuqaleni, ungasebenzisa imisebenzi eyakhiwe ngaphambilini ukuze wenze imisebenzi ethile. Yilokho kanye okunikezwa yi -Bright Data's Scraping Functions !
Ngemisebenzi ye-JavaScript engu-73+ eyenziwe ngomumo, abasebenzisi bakhe ama-scrapers angaphezu kuka-38K asebenza emazweni angu-195+. Lokho kungamandla amakhulu okukhuhla! 📈
Sheshisa ukuthuthukiswa kwakho ngendawo yesikhathi sokusebenza eklanyelwe ukukhuhla, ukuvula, kanye nokukala ukuqoqwa kwedatha yewebhu kalula:
Manje usuyazi ukuthi ungasinyusa kanjani izinga le-scraper yakho ngemininingwane evela konjiniyela abanolwazi lokukhuhla!
Khumbula ukuthi lokhu kuyiNgxenye yesi-3 kuphela, ngakho-ke sesiphakathi kohambo lwethu olunezingxenye eziyisithupha zokuya ekuklwebeni okuthuthukisiwe kwewebhu! Gcina lelo bhande liboshiwe ngoba sesizocwila kubuchwepheshe obuphambili nakakhulu, izixazululo ezihlakaniphile, namathiphu angaphakathi.
Isitobhi esilandelayo? Ukusebenzisa amandla okuphathwa kommeleli oshayelwa yi-AI! 🌐