❗ : Lena Ingxenye 3 yochungechunge lwethu lwezingxenye eziyisithupha ku-Advanced Web Scraping. Umusha ochungechungeni? ! Umshwana wokuzihlangula Qala kusukela ekuqaleni ngokufunda Ingxenye 1 Engxenyeni yesi-2 yochungechunge lwethu Lwe-Web Scraping Ethuthukisiwe, ufunde . Njengamanje, kufanele ube nalo lonke ulwazi oludingekayo ukuze wakhe i-scraper esebenza ngokumelene namawebhusayithi amaningi esimanje. ukusula idatha kuma-SPA, ama-PWAs, kanye namasayithi anikwe amandla yi-AI Yini elandelayo? Isikhathi sokuthuthukisa i-scraper yakho ngamanye amathiphu namaqhinga okuklwebha! I-Web Scraping Yenziwe Yalula—Noma Ingabe Injalo? Ukwakha i-web scraper? Konke kumayelana . 👨💻 nokubhala Futhi masikhulume iqiniso—uma uke wabhala ikhodi, uyazi ukubhala akulula isikhathi esiningi. Imigqa embalwa lapha, lapho, kanye ne-boom, ukhipha idatha njengochwepheshe. Kuzwakala kulula, akunjalo? 😄 kangako for Kodwa nansi inkinga: ubulula bokubhala i-scraper encane bungakwenza ube nomuzwa wokulondeka ongamanga. Kungani uzihlupha ngamazwana afanelekile, ukuphatha amaphutha, amalogi, noma ukuhlehlisa kahle uma kuyimigqa yekhodi eyishumi nambili engafundwa yinoma ubani? Siyayithola—kungani i-verengineer into engayidingi? I-Overengineering senqubekelaphambili. Kodwa kwenzekani uma udinga ukukala ama-scrapers akho kumakhasi amaningi noma amasayithi wonke? 🤔 isitha Yilapho isikhihli sakho esisheshayo nesingcolile, esinekhodi ye-spaghetti sihlukana khona! 🍝 Nakhu ukuthi kungani udinga amathiphu athuthukile wokukhuhla iwebhu. Ukuthuthukisa i-Web Scraping: Amathiphu aphezulu angu-7 namasu Kufanele ukuthi usuwazwile kakade amathiphu ajwayelekile okuklwebha iwebhu: beka phambili amakhasi ngedatha ebalulekile kuqala, hlela izicelo zakho ngokungahleliwe, njalo njalo. Iseluleko esihle—kodwa masikhulume iqiniso, lawo maqhinga yizindaba ezindala. 📰 Uma ubhekene nezimo ezithuthuke kakhulu, lezo zisekelo zingase zingakunqamuli. Uma ufuna ngempela ukukhulisa igeyimu yakho yokuklwebha, uzodinga ukuhlola amasu eleveli elandelayo. Ulungile? Bopha—sekuyisikhathi sokuyisa amakhono akho okukhuhla iwebhu ezingeni elilandelayo! 💪 ⚠️ Ungakhathazeki uma amanye amathiphu uzizwa uwajwayele—qhubeka! Kunemininingwane eminingi ethokozisayo njengoba ujula! 🤿 Isexwayiso: Sebenzisa Ukuphatha Iphutha Elinye lamaphutha ajwayeleke kakhulu ekukhunjweni kwewebhu ukukhohlwa ukuthi i-inthanethi akubona ubuchwepheshe obuyimilingo, obungenaphutha. Uma uthumela isicelo kusayithi, lonke uhla lwezinto lunga (futhi, ngesikhathi esithile) lungahambi kahle. ❌ Ake sibheke ezinye izimo ezijwayelekile: I-Wi-Fi yakho noma ukuxhumeka kwakho kungase kuphazamise isikhashana Iseva esingethe iwebhusayithi kungenzeka ingatholakali Ikhasi olifunayo kungenzeka alisekho Isayithi okuhlosiwe kungenzeka libhekene nokwehla kwesikhashana, okuholela ephuthani lokuphela kwesikhathi Manje, hlanganisa ukucozulula idatha, ukucubungula kusengaphambili, kanye nokuthekelisa kusizindalwazi, futhi usuthole iresiphi ephelele yezinxushunxushu. 💥 Ngakho, liyini ikhambi? ! 🛡️ Ukuphatha iphutha Ukuphatha iphutha umngane wakho omkhulu ekukhunjweni kwewebhu. Umbhalo wakho cishe uzocubungula inqwaba (noma izinkulungwane) zamakhasi, futhi iphutha elilodwa akufanele lehlise wonke umsebenzi wakho. Khumbula ukuthi i block ingumngane wakho. Yisebenzise ukugoqa izicelo zakho kanye nokucubungula okunengqondo. Futhi, khumbula ukuthi imitapo yolwazi eminingi ye-HTTP ayiphakamisi okuhlukile kuzimpendulo ezimbi ze-HTTP (njenge- noma ). 😲 try ... catch 404 500 Uma ungawazi , bona ividiyo engezansi: amakhodi esimo se-HTTP https://www.youtube.com/watch?v=wJa5CTIFj7U&embedable=true Isibonelo, kumtapo udinga ukuhlola ngokwakho ikhodi yesimo sokuphendula ngale ndlela elandelayo: wezincwadi wezicelo wePython import requests response = requests.get("https://example.com") if response.status_code == 200: # handle the successful response... else: # handle the error response... Noma, ngokufanayo, sebenzisa indlela : yokuphakamisa_for_status() import requests try: response = requests.get("https://example.com") # raises an HTTPError for bad responses (4xx or 5xx) response.raise_for_status() # handle the successful response... except requests.exceptions.HTTPError as http_err: # handle an HTTP error... except requests.exceptions.RequestException as req_err: # handle a request error... Ukwehluleka Ukutholwa Ngokuphinda Kwesicelo Iskripthi sakho esithuthukisiwe se-web scraping akufanele nje kuphela sikwazi ukuphatha amaphutha kodwa futhi silulame kuwo. Njengoba amaphutha amaningi ahlobene ne-web scraping aboshelwe ekwenzeni izicelo zewebhu, . ungathuthukisa kakhulu ukusebenza kwe-scraper yakho ngokufaka izicelo ezingazanywa kabusha Umqondo ulula: uma isicelo sehluleka, uyazama futhi—isikhathi esisodwa, ezimbili, ezintathu, noma ngaphezulu—kuze kube yilapho siphumelela. 🔄 Kodwa nakhu okubambekayo: njengoba esinye sezizathu ezivame kakhulu zesicelo esihlulekile ukuthi iseva eqondiwe iphansi noma ihamba kancane, awufuni ukuyiqeda amandla ngokuthumela isicelo esifanayo ngokuphindaphindiwe esikhathini esifushane. Uma isicelo sehluleka manje, kungenzeka sihluleke futhi ngokushesha. Kulapho-ke iqala khona ukusebenza! i-backoff ye-exponential Esikhundleni sokuzama futhi ngokushesha, le nqubo inyusa kancane kancane isikhathi phakathi kokuzama futhi, ithuthukisa amathuba akho empumelelo ngokunikeza iseva eqondiwe isikhathi sokululama. ⏳ Nakuba ungakwazi ukwenza amasu okuzama kabusha ngesandla alula ngekhodi yangokwezifiso, amaklayenti amaningi e-HTTP eza nezinsiza ezakhelwe ngaphakathi noma imitapo yolwazi ukuze isingathe ukuzama futhi ngokuzenzakalelayo. Isibonelo, i-Axios inikezela ngomtapo wolwazi , ongawusebenzisa kanje: we-axios-retry const axios = require("axios"); const axiosRetry = require("axios-retry"); axiosRetry(axios, { retries: 3, retryDelay: axiosRetry.exponentialDelay }); axios.get('https://example.com') .then(response => console.log(response.data)) .catch(error => console.log("Request failed:", error)); Ngokufanayo, iphakheji le-Python's liza nekilasi futhi elihlanganisa ngaphandle komthungo namaklayenti . urllib3 elithi Zama amaningi e-Python HTTP Bhala Izikhethi Ezijwayelekile Lapho uhlola izinto ku-DevTools, ungase ulingeke ukuba uchofoze kwesokudla bese ukhetha inketho ethi "Kopisha isikhethi": Kodwa xwayiswa, umphumela ungase ubukeke kanje: #__next > div > main > div.sc-d7dc08c8-0.fGqCtJ > div.sc-93e186d7-0.eROqxA > h1 Lokho akulungile neze ukukhuhla iwebhu…. Inkinga? Izikhethi ezicaciswe ngokweqile ezifana nalezi zingaphuka kalula uma isakhiwo sekhasi sishintsha. Lapho isikhethi sakho sinemininingwane eminingi, siba ntekenteke kakhulu. Ukuze wenze ukuklwebha kwakho kwewebhu kuqine, kufanele ugcine abakhethi bakho bevumelana nezimo. Esikhundleni sokuthembela kumakilasi ahlobene nesitayela (ashintsha ngaso sonke isikhathi), , , noma . Iningi lalezo zichasiso zenzelwe , ngakho-ke zivame ukuhlala zingashintshile ngokuhamba kwesikhathi. 💡 gxila kuzibaluli okungenzeka zishintshe, njenge id data- aria- ukuhlolwa nokufinyeleleka Futhi nakuba kulula ukuzifunda nokuqonda, inikeza amandla engeziwe. Kodwa ungakhathazeki—ngokuvamile ungazuza imiphumela efanayo ngezikhethi ezilula ze-CSS, zikusindisa ekudingeni ikhodi eyinkimbinkimbi ye-XPath. 😌 izikhethi ze-CSS i-XPath Ukuze uthole ulwazi olwengeziwe ngalokho, bheka umhlahlandlela wethu mayelana ! nabakhethi be-XPath vs CSS Khetha Abahlaziyi be-HTML Esheshayo Ukuhlaziya amakhasi e-HTML kuthatha isikhathi nezinsiza, ikakhulukazi uma usebenzisana ne-DOM enkulu, efakwe esidlekeni. Uma i-scraper yakho ihlaziya amakhasi ambalwa kuphela, akuyona into enkulu kangako. Manje, kwenzekani uma umsebenzi wakho wokukrwela ukhuphuka futhi kufanele ubuyise idatha ezigidini zamakhasi? Leyo nhloko encane ingadonsa ngokushesha izinsiza zeseva futhi yengeze amahora kusikhathi sakho sokukrwela esiphelele. ⏳ Ukuze uthole ukuqonda okujulile, bheka lezi zinsiza: Ukuqhathanisa kokusebenza komhlahleli wePython HTML Ibhentshimakhi yemitapo yolwazi ye-JavaScript yokuhlaziya i-HTML I-HTML Parsers Benchmark Ufuna isiqhathaniso esigcwele? Funda isihloko sethu . ngabahlaluli be-HTML abahamba phambili Izindaba ezinhle? Ukushintsha usuka kwesinye umhlangani uye kwesinye akunzima kangako. Isibonelo, ku , kuwushintsho olulula lwepharamitha: -BeautifulSoup from bs4 import BeautifulSoup # or using html.parser soup = BeautifulSoup(html_content, "html.parser") # or using lxml parser soup = BeautifulSoup(html_content, "lxml") Futhi kuthiwani ngabahlahli be-HTML bakhelwe kuziphequluli ezifana ne-Chrome? 🤔 Thola okwengeziwe kuvidiyo engezansi: https://www.youtube.com/watch?v=LLRig4s1_yA&embedable=true Bopha i-HTTP/2 ngezicelo ezisheshayo inguqulo ebuyekeziwe ye-HTTP evumela izicelo eziningi ngoxhumano olulodwa. Lokhu kunciphisa ukubambezeleka futhi kungathuthukisa ukusebenza okuphelele komsebenzi wakho wokukhuhla. I-HTTP/2 Ukuhlola ukuthi isayithi liyayisekela yini i-HTTP/2, vele uvule i-DevTools esipheqululini sakho, uye kuthebhu ethi “Inethiwekhi”, bese ubheka ikholomu ethi “Protocol”—uma ithi , isayithi lisebenzisa i-HTTP/2: h2 Ngeshwa, akuwona wonke amaklayenti e-HTTP namalabhulali e-scraping asekela i-HTTP/2. Nokho, amathuluzi afana ne . -HTTPX yePython anikeza ukusekelwa okugcwele kwe-HTTP/2 Ukufana komsebenzi I-Web scraping ngokuvamile iwumsebenzi -uthumela izicelo kuseva, ulinde impendulo, ucubungule idatha, futhi uphinde. Ngesikhathi sokulinda, i-scraper yakho empeleni ayisebenzi, okungasebenzi kahle. we-I/O Isixazululo? ! Ukufana noma ukuhambisana Ngokuthumela izicelo eziningi ngesikhathi esisodwa, unganciphisa lezo zikhathi ezifile futhi wandise ukusetshenziswa kwenethiwekhi. 🚨 Kodwa qaphela! 🚨 Ukuqhumisa iseva ngezicelo eziningi kakhulu ngesikhathi esisodwa kungaholela ekukhawulweni kwezinga noma ekuvinjweni kwe-IP yakho— . 😬 izinyathelo ezimbili ezidumile zokulwa nokukhuhla : Ungakwazi futhi ukufanisa imisebenzi yokudlulisa, ikakhulukazi uma usebenzisa ama-CPU amaningi, okuzosheshisa inqubo yokukhipha idatha. ⚡ Ithiphu yochwepheshe Yamukela i-AI-based Adaptive Algorithms afunda kumaphethini edatha nezakhiwo zekhasi le-HTML, alungise ukuziphatha kwawo ngesikhathi sangempela ukuze ahlale phezu kwezinguquko. 😮 Ama-algorithms asekelwe ku-AI Lokho kushintshile umdlalo wokukhuhla iwebhu! 🤯 Lapho amawebhusayithi ebuyekeza ukwakheka kwawo noma esebenzisa izinyathelo zokulwa ne-bot, lawa ma-algorithms angashintsha ngokushesha, aqinisekise ukuthi i-scraper yakho iqhubeka isebenza kahle. 🧠 Ngamafuphi, benza ama-scrapers ahlakaniphe, akusize ukhiphe idatha ngokuphumelelayo-ngisho nalapho isayithi liphonsa ama-curveballs angalindelekile. ⚾ Ngama-algorithms aguquguqukayo, kufana nokuba ne-scraper eguqukayo ngokuhamba kwesikhathi! Funda kabanzi eSahlukweni 4 sale vidiyo kaForrest Knight: https://www.youtube.com/watch?v=vxk6YPRVg_o&embedable=true Ithuluzi Elingcono Kakhulu Lokwenza I-Web Scraping Optimization Impela, wonke amathiphu namasu esiwabalulile kuze kube manje angenza i-scraper yakho isheshe, ithembeke kakhulu, iqine, futhi isebenze kahle. Kodwa masibe ngokoqobo—futhi ziletha ubunkimbinkimbi obuningi. 😅 Izindaba ezinhle ukuthi eziningi zalezi zifundo zisebenza kuningi lamaphrojekthi wokukhuhla. Ngakho-ke, esikhundleni sokufaka amakhodi yonke into kusukela ekuqaleni, ungasebenzisa imisebenzi eyakhiwe ngaphambilini ukuze wenze imisebenzi ethile. Yilokho kanye okunikezwa yi ! -Bright Data's Scraping Functions Ngemisebenzi ye-JavaScript engu-73+ eyenziwe ngomumo, abasebenzisi bakhe ama-scrapers angaphezu kuka-38K asebenza emazweni angu-195+. Lokho kungamandla amakhulu okukhuhla! 📈 Sheshisa ukuthuthukiswa kwakho ngendawo yesikhathi sokusebenza eklanyelwe ukukhuhla, ukuvula, kanye nokukala ukuqoqwa kwedatha yewebhu kalula: https://www.youtube.com/watch?v=Ve04_6gDKvU&embedable=true Imicabango yokugcina Manje usuyazi ukuthi ungasinyusa kanjani izinga le-scraper yakho ngemininingwane evela konjiniyela abanolwazi lokukhuhla! Khumbula ukuthi lokhu kuyiNgxenye yesi-3 kuphela, ngakho-ke sesiphakathi kohambo lwethu olunezingxenye eziyisithupha zokuya ekuklwebeni okuthuthukisiwe kwewebhu! Gcina lelo bhande liboshiwe ngoba sesizocwila kubuchwepheshe obuphambili nakakhulu, izixazululo ezihlakaniphile, namathiphu angaphakathi. Isitobhi esilandelayo? Ukusebenzisa amandla okuphathwa kommeleli oshayelwa yi-AI! 🌐