Mugihe imbuga zimwe zoroshye gusiba ukoresheje Selenium, Puppeteer, nibindi nkibyo, izindi mbuga zishyira mubikorwa ingamba zumutekano zigezweho nka CAPTCHAs no kubuza IP birashobora kugorana. Kugira ngo utsinde izo mbogamizi kandi urebe ko ushobora gusiba 99% byurubuga kubuntu ukoresheje Scraper, uzaba wubaka muriyi ngingo, kandi uzaba uhuza a
Ariko, gukusanya amakuru ni intambwe imwe gusa; ibyo ukora hamwe naya makuru birangana, niba atari byinshi, byingenzi. Akenshi, ibi bisaba gushungura cyane mububiko bunini bwamakuru. Ariko tuvuge iki niba ushobora gutangiza iki gikorwa? Ukoresheje imvugo y'ururimi (LLM), ntushobora gukusanya amakuru gusa ahubwo ushobora no kubaza kugirango ukuremo ubushishozi bufite akamaro - gutakaza umwanya n'imbaraga.
Muri iki gitabo, uziga uburyo bwo guhuza ibisakuzo byurubuga na AI kugirango wubake igikoresho gikomeye cyo gukusanya no gusesengura amakuru ku gipimo cyubusa. Reka twibire!
Mbere yo gutangira, menya ko ufite ibi bikurikira:
Gukomeza niyi nyigisho, uzuza intambwe zikurikira:
Kurikiza izi ntambwe kugirango ushireho ibidukikije kandi witegure kubaka scraper ikoreshwa na AI.
Ubwa mbere, shiraho ibidukikije kugirango ucunge umushinga wawe. Ibi bizemeza ko ufite umwanya wihariye kubintu byose bisabwa.
Kora umushinga mushya wububiko:
Fungura terminal yawe (cyangwa Command Prompt / PowerShell kuri Windows) hanyuma ukore ububiko bushya kumushinga wawe:
mkdir ai-website-scraper cd ai-website-scraper
Kora ibidukikije:
Koresha itegeko rikurikira kugirango ureme ibidukikije:
Kuri Windows:
python -m venv venv
Kuri macOS / Linux:
python3 -m venv venv
Ibi birema ububiko venv
buzabika ibidukikije.
Koresha ibidukikije kugirango utangire gukora muri byo:
Kuri Windows:
.\venv\Scripts\activate
Kuri macOS / Linux:
source venv/bin/activate
Indangururamajwi yawe izahinduka kugirango yerekane ( venv
), yemeza ko uri imbere mubidukikije.
Noneho, shyiramo amasomero umushinga wawe ukeneye. Kora requirements.txt
dosiye mububiko bwumushinga wawe hanyuma ongeraho ibikurikira:
streamlit selenium Beautifulsoup4 langchain langchain-ollama lxml html5lib
Izi paki ningirakamaro mugusiba, gutunganya amakuru, no kubaka UI:
streamlit : Ibi bikoreshwa mugukora interineti yimikoreshereze.
Selenium : Kubikuraho ibirimo kurubuga.
beautifulsoup4 : Kubisesengura no gusukura HTML.
langchain na langchain-ollama : Ibi ni uguhuza na Ollama LLM no gutunganya inyandiko.
lxml na html5lib : Kubisobanuro bya HTML bigezweho.
Shyiramo ibishingirwaho ukoresheje itegeko rikurikira:
(Menya neza ko uri mububiko aho dosiye iherereye mbere yo gukora itegeko.)
pip install -r requirements.txt
Kora dosiye yitwa ui.py mububiko bwumushinga wawe. Iyi nyandiko izasobanura UI kuri scraper yawe. Koresha kode ikurikira kugirango utegure porogaramu yawe:
import streamlit as st import pathlib from main import scrape_website # function to load css from the assets folder def load_css(file_path): with open(file_path) as f: st.html(f"<style>{f.read()}</style>") # Load the external CSS css_path = pathlib.Path("assets/style.css") if css_path.exists(): load_css(css_path) st.title("AI Scraper") st.markdown( "Enter a website URL to scrape, clean the text content, and display the result in smaller chunks." ) url = st.text_input(label= "", placeholder="Enter the URL of the website you want to scrape") if st.button("Scrape", key="scrape_button"): st.write("scraping the website...") result = scrape_website(url) st.write("Scraping complete.") st.write(result)
Urashobora kwiga byinshi kubyerekeranye na streamlit ibice biva muri byo
Kugirango uhindure porogaramu yawe, kora ububiko bwububiko mububiko bwumushinga wawe hanyuma wongere dosiye.css. Hindura interineti ya Streamlit hamwe na CSS:
.stAppViewContainer { background-image: url("https://images.unsplash.com/photo-1732979887702-40baea1c1ff6?q=80&w=2832&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"); background-size: cover; color: black; } .stAppHeader { background-color: rgba(0, 0, 0, 0); } .st-ae { background-color: rgba(233, 235, 234, 0.895); } .st-emotion-cache-ysk9xe { color: black; } .st.info, .stAlert { background-color: black; } .st-key-scrape_button button { display: inline-block; padding: 10px 20px; font-size: 16px; color: #fff; background-color: #007bff; border: none; border-radius: 5px; cursor: pointer; animation: pulse 2s infinite; } .st-key-scrape_button button:hover { background-color: #0056b3; color: #fff; }
Mububiko bwumushinga wawe, koresha itegeko rikurikira:
streamlit run ui.py
Ibi bizatangiza seriveri yaho, kandi ugomba kubona URL muri terminal, mubisanzwe http://localhost:8501
. Fungura iyi URL muri mushakisha yawe kugirango uhuze na porogaramu y'urubuga.
Ibikurikira, andika kode kugirango ukuremo HTML ibiri mururubuga urwo arirwo rwose ukoresheje Selenium. Ariko, kugirango code ikore, ukeneye Chrome WebDriver.
Selenium isaba WebDriver gukorana nurupapuro rwurubuga. Dore uko wabishyiraho:
Nyuma yo gukuramo ChromeDriver, kura dosiye hanyuma wandukure izina rya dosiye isaba " chromedriver " hanyuma uyishyire mububiko bwumushinga wawe.
Iyo ibi birangiye, kora dosiye nshya yitwa main.py
hanyuma ushyire mubikorwa code hepfo:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # Function to scrape HTML from a website def scrape_website(website_url): # Path to WebDriver webdriver_path = "./chromedriver" # Replace with your WebDriver path service = Service(webdriver_path) driver = webdriver.Chrome(service=service) try: # Open the website driver.get(website_url) # Wait for the page to fully load WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "body"))) # Extract the HTML source html_content = driver.page_source return html_content finally: # Ensure the browser is closed after scraping driver.quit()
Bika kandi ukoreshe kode; ugomba kubona HTML zose zurupapuro wasibye zerekanwe muri progaramu yawe ya streamlit nkiyi:
Mugihe ubu ushobora kugarura HTML yurubuga, code yavuzwe haruguru ntishobora gukora kurubuga rufite uburyo bugezweho bwo kurwanya scraping nkibibazo bya CAPTCHA cyangwa kubuza IP. Kurugero, gusiba urubuga nka Mubyukuri cyangwa Amazone ukoresheje Selenium bishobora kuvamo page ya CAPTCHA ibuza kwinjira. Ibi bibaho kuko urubuga rusanga bot igerageza kugera kubirimo. Niba iyi myitwarire ikomeje, urubuga rushobora guhagarika aderesi ya IP, bikabuza kwinjira.
Kugira ngo ukosore ibi, shyira hamwe
Kwiyandikisha - jya kuri
Nyuma yo kwinjira, kanda kuri " Get Proxy Products ".
Kanda kuri bouton " Ongeraho " hanyuma uhitemo " Scraping Browser ."
Ibikurikira, uzajyanwa kurupapuro rwa " Ongeraho zone ", aho uzasabwa guhitamo izina rya porogaramu nshya ya scraping ya mushakisha. Nyuma yibyo, kanda kuri " Ongera ".
Nyuma yibi, ibyangombwa bya proxy yawe bizashyirwaho. Uzakenera ibisobanuro birambuye mumyandikire yawe kugirango wirengagize uburyo ubwo aribwo bwose bwo kurwanya scraping bukoreshwa kurubuga urwo arirwo rwose.
Urashobora kandi kugenzura amakuru ya Bright Data yatezimbere ibyangombwa kugirango ubone ibisobanuro birambuye kubyerekeye gushakisha.
Muri dosiye yawe main.py
, hindura kode kuriyi. Uzarebe ko iyi code ifite isuku kandi ngufi kuruta code yabanjirije.
from selenium.webdriver import Remote, ChromeOptions from selenium.webdriver.chromium.remote_connection import ChromiumRemoteConnection from selenium.webdriver.common.by import By from bs4 import BeautifulSoup AUTH = '<username>:<passord>' SBR_WEBDRIVER = f'https://{AUTH}@brd.superproxy.io:9515' # Function to scrape HTML from a website def scrape_website(website_url): print("Connecting to Scraping Browser...") sbr_connection = ChromiumRemoteConnection(SBR_WEBDRIVER, "goog", "chrome") with Remote(sbr_connection, options=ChromeOptions()) as driver: driver.get(website_url) print("Waiting captcha to solve...") solve_res = driver.execute( "executeCdpCommand", { "cmd": "Captcha.waitForSolve", "params": {"detectTimeout": 10000}, }, ) print("Captcha solve status:", solve_res["value"]["status"]) print("Navigated! Scraping page content...") html = driver.page_source return html
Simbuza <username> na <password> hamwe nibisobanuro byawe bya mushakisha ukoresha hamwe nijambobanga.
Nyuma yo gusiba HTML yibirimo kurubuga, akenshi iba yuzuyemo ibintu bitari ngombwa nka JavaScript, imiterere ya CSS, cyangwa tagi udashaka bidatanga umusanzu kumakuru yibanze ukuramo. Kugirango amakuru arusheho gutunganywa kandi afite akamaro mugutezimbere kurushaho, ugomba gusukura ibiri muri DOM ukuraho ibintu bidafite akamaro no gutunganya inyandiko.
Iki gice gisobanura uburyo bwo guhanagura ibiri muri HTML, gukuramo inyandiko ifite ireme, no kuyigabanyamo uduce duto two gutunganya hasi. Igikorwa cyogusukura ningirakamaro mugutegura amakuru kumirimo nko gutunganya ururimi karemano cyangwa gusesengura ibirimo.
Dore code izongerwaho kuri main.py kugirango ikore isuku yibirimo DOM:
from bs4 import BeautifulSoup # Extract the body content from the HTML def extract_body_content(html_content): soup = BeautifulSoup(html_content, "html.parser") body_content = soup.body if body_content: return str(body_content) return "" # Clean the body content by removing scripts, styles, and other unwanted elements def clean_body_content(body_content): soup = BeautifulSoup(body_content, "html.parser") # Remove <script> and <style> tags for script_or_style in soup(["script", "style"]): script_or_style.extract() # Extract cleaned text with each line separated by a newline cleaned_content = soup.get_text(separator="\n") cleaned_content = "\n".join( line.strip() for line in cleaned_content.splitlines() if line.strip() ) return cleaned_content # Split the cleaned content into smaller chunks for processing def split_dom_content(dom_content, max_length=5000): return [ dom_content[i : i + max_length] for i in range(0, len(dom_content), max_length) ]
Icyo Kode ikora
Bika impinduka zawe kandi ugerageze gusaba. Ugomba kubona ibisohoka nkibi nyuma yo gusiba urubuga.
Ibirimo DOM bimaze guhanagurwa no gutegurwa, intambwe ikurikira ni uguhuza amakuru kugirango ukuremo amakuru yihariye ukoresheje
Niba utarabikora, kura hanyuma ushyire Ollama kuva kuri
brew install ollama
Ibikurikira, shyiramo icyitegererezo icyo aricyo cyose
ollama pull phi3
Nyuma yo kwishyiriraho, urashobora guhamagara kuri moderi uhereye kumyandikire yawe ukoresheje LangChain kugirango utange ubushishozi buva mumibare azoherezwa.
Dore uko washyiraho imikorere yo gusesengura DOM muburyo bwa phi3
Kode ikurikira ishyira mubikorwa logique yo gusesengura uduce twa DOM hamwe na Ollama no gukuramo amakuru arambuye:
from langchain_ollama import OllamaLLM from langchain_core.prompts import ChatPromptTemplate # Template to instruct Ollama for parsing template = ( "You are tasked with extracting specific information from the following text content: {dom_content}. " "Please follow these instructions carefully: \n\n" "1. **Extract Information:** Only extract the information that directly matches the provided description: {parse_description}. " "2. **No Extra Content:** Do not include any additional text, comments, or explanations in your response. " "3. **Empty Response:** If no information matches the description, return an empty string ('')." "4. **Direct Data Only:** Your output should contain only the data that is explicitly requested, with no other text." ) # Initialize the Ollama model model = OllamaLLM(model="phi3") # Function to parse DOM chunks with Ollama def parse_with_ollama(dom_chunks, parse_description): prompt = ChatPromptTemplate.from_template(template) chain = prompt | model parsed_results = [] for i, chunk in enumerate(dom_chunks, start=1): if not chunk.strip(): # Skip empty chunks print(f"Skipping empty chunk at batch {i}") continue try: print(f"Processing chunk {i}: {chunk[:100]}...") # Print a preview print(f"Parse description: {parse_description}") response = chain.invoke( { "dom_content": chunk, "parse_description": parse_description, } ) print(f"Response for batch {i}: {response}") parsed_results.append(response) except Exception as e: print(f"Error parsing chunk {i}: {repr(e)}") parsed_results.append(f"Error: {repr(e)}") return "\n".join(parsed_results)
Ongera kode ikurikira muri dosiye ya ui.py kugirango wemerere abakoresha kwinjiza amabwiriza ya parsing kuri LLM hanyuma urebe ibisubizo:
from main import scrape_website, extract_body_content, clean_body_content, split_dom_content from llm import parse_with_ollama if "dom_content" in st.session_state: parse_description = st.text_area("Enter a description to extract specific insights from your scraped data:") if st.button("Parse Content", key="parse_button"): if parse_description.strip() and st.session_state.get("dom_content"): st.info("Parsing the content...") dom_chunks = split_dom_content(st.session_state.dom_content) parsed_result = parse_with_ollama(dom_chunks, parse_description) st.text_area("Parsed Results", parsed_result, height=300) else: st.error("Please provide valid DOM content and a description to parse.")
Hamwe nibi bikorwa, scraper irashobora gutanga ibisubizo kubibazo byawe ukurikije amakuru yakuweho.
Gukomatanya gusiba urubuga na AI byugurura uburyo bushimishije kubushishozi bwashizweho namakuru. Usibye gukusanya no kubika amakuru, urashobora noneho gukoresha AI kugirango uhindure inzira yo kunguka ubushishozi uhereye kumakuru yakuweho. Ibi ni ingirakamaro mu kwamamaza no kugurisha amakipe, gusesengura amakuru, ba nyiri ubucuruzi, nibindi byinshi.
Urashobora kubona code yuzuye ya scraper ya AI hano. Wumve neza ko ugerageza nayo kandi uyihuze nibyo ukeneye bidasanzwe. Intererano nazo ziremewe-niba ufite ibitekerezo byo kunonosora, tekereza gukora icyifuzo cyo gukurura!
Urashobora kandi gufata ibi kure. Dore ibitekerezo bimwe: