paint-brush
Reka Twubake Igikoresho cyo Gusiba Urubuga rwubusa ruhuza Proxy na AI yo gusesengura amakuruna@aviatorscode2
Amateka mashya

Reka Twubake Igikoresho cyo Gusiba Urubuga rwubusa ruhuza Proxy na AI yo gusesengura amakuru

na Victor Yakubu15m2024/12/17
Read on Terminal Reader

Birebire cyane; Gusoma

Wige uburyo bwo guhuza ibisakuzo byurubuga, proksi, hamwe nururimi rukoreshwa na AI kugirango uhindure amakuru kandi wunguke ubushishozi bitagoranye.
featured image - Reka Twubake Igikoresho cyo Gusiba Urubuga rwubusa ruhuza Proxy na AI yo gusesengura amakuru
Victor Yakubu HackerNoon profile picture
0-item
1-item
2-item

Mugihe imbuga zimwe zoroshye gusiba ukoresheje Selenium, Puppeteer, nibindi nkibyo, izindi mbuga zishyira mubikorwa ingamba zumutekano zigezweho nka CAPTCHAs no kubuza IP birashobora kugorana. Kugira ngo utsinde izo mbogamizi kandi urebe ko ushobora gusiba 99% byurubuga kubuntu ukoresheje Scraper, uzaba wubaka muriyi ngingo, kandi uzaba uhuza a igikoresho muri code yawe izafasha kurenga izi ngamba zumutekano.


Ariko, gukusanya amakuru ni intambwe imwe gusa; ibyo ukora hamwe naya makuru birangana, niba atari byinshi, byingenzi. Akenshi, ibi bisaba gushungura cyane mububiko bunini bwamakuru. Ariko tuvuge iki niba ushobora gutangiza iki gikorwa? Ukoresheje imvugo y'ururimi (LLM), ntushobora gukusanya amakuru gusa ahubwo ushobora no kubaza kugirango ukuremo ubushishozi bufite akamaro - gutakaza umwanya n'imbaraga.


Muri iki gitabo, uziga uburyo bwo guhuza ibisakuzo byurubuga na AI kugirango wubake igikoresho gikomeye cyo gukusanya no gusesengura amakuru ku gipimo cyubusa. Reka twibire!

Ibisabwa

Mbere yo gutangira, menya ko ufite ibi bikurikira:

  1. Ubumenyi bwibanze bwa Python, nkuyu mushinga urimo kwandika no gusobanukirwa kode ya Python.
  2. Shyira Python (3.7 cyangwa nyuma) kuri sisitemu. Urashobora kuyikuramo python.org .

Kwinjiza no Gushiraho

Gukomeza niyi nyigisho, uzuza intambwe zikurikira:

Kurikiza izi ntambwe kugirango ushireho ibidukikije kandi witegure kubaka scraper ikoreshwa na AI.

1. Kurema Ibidukikije

Ubwa mbere, shiraho ibidukikije kugirango ucunge umushinga wawe. Ibi bizemeza ko ufite umwanya wihariye kubintu byose bisabwa.


  1. Kora umushinga mushya wububiko:

    Fungura terminal yawe (cyangwa Command Prompt / PowerShell kuri Windows) hanyuma ukore ububiko bushya kumushinga wawe:

     mkdir ai-website-scraper cd ai-website-scraper


  2. Kora ibidukikije:

Koresha itegeko rikurikira kugirango ureme ibidukikije:


  • Kuri Windows:

     python -m venv venv
  • Kuri macOS / Linux:

     python3 -m venv venv


Ibi birema ububiko venv buzabika ibidukikije.


2. Kora Ibidukikije

Koresha ibidukikije kugirango utangire gukora muri byo:


  • Kuri Windows:

     .\venv\Scripts\activate
  • Kuri macOS / Linux:

     source venv/bin/activate


Indangururamajwi yawe izahinduka kugirango yerekane ( venv ), yemeza ko uri imbere mubidukikije.

3. Shyiramo Ibisabwa

Noneho, shyiramo amasomero umushinga wawe ukeneye. Kora requirements.txt dosiye mububiko bwumushinga wawe hanyuma ongeraho ibikurikira:


 streamlit selenium Beautifulsoup4 langchain langchain-ollama lxml html5lib


Izi paki ningirakamaro mugusiba, gutunganya amakuru, no kubaka UI:

  • streamlit : Ibi bikoreshwa mugukora interineti yimikoreshereze.

  • Selenium : Kubikuraho ibirimo kurubuga.

  • beautifulsoup4 : Kubisesengura no gusukura HTML.

  • langchain na langchain-ollama : Ibi ni uguhuza na Ollama LLM no gutunganya inyandiko.

  • lxml na html5lib : Kubisobanuro bya HTML bigezweho.


Shyiramo ibishingirwaho ukoresheje itegeko rikurikira:

(Menya neza ko uri mububiko aho dosiye iherereye mbere yo gukora itegeko.)


 pip install -r requirements.txt


Kubaka UI hamwe na Streamlit

Streamlit byoroshye gukora interineti ikoresha interineti (UI) ya porogaramu ya Python. Muri iki gice, uzubaka ibintu byoroshye, byorohereza abakoresha aho abakoresha bashobora kwinjiza URL no kwerekana amakuru yakuweho.

1. Shiraho inyandiko ya Streamlit

Kora dosiye yitwa ui.py mububiko bwumushinga wawe. Iyi nyandiko izasobanura UI kuri scraper yawe. Koresha kode ikurikira kugirango utegure porogaramu yawe:

 import streamlit as st import pathlib from main import scrape_website # function to load css from the assets folder def load_css(file_path): with open(file_path) as f: st.html(f"<style>{f.read()}</style>") # Load the external CSS css_path = pathlib.Path("assets/style.css") if css_path.exists(): load_css(css_path) st.title("AI Scraper") st.markdown( "Enter a website URL to scrape, clean the text content, and display the result in smaller chunks." ) url = st.text_input(label= "", placeholder="Enter the URL of the website you want to scrape") if st.button("Scrape", key="scrape_button"): st.write("scraping the website...") result = scrape_website(url) st.write("Scraping complete.") st.write(result)


  • Imikorere ya st.title na st.markdown ishyiraho umutwe wa progaramu kandi utange amabwiriza kubakoresha.
  • Igice cya st.text_input ituma abakoresha binjiza URL yurubuga bashaka gusiba.
  • Kanda kuri bouton "Scrape" itera logique yo gusiba, yerekana ubutumwa bwiterambere ukoresheje st.info .


Urashobora kwiga byinshi kubyerekeranye na streamlit ibice biva muri byo Inyandiko .

2. Ongeraho Imiterere yihariye

Kugirango uhindure porogaramu yawe, kora ububiko bwububiko mububiko bwumushinga wawe hanyuma wongere dosiye.css. Hindura interineti ya Streamlit hamwe na CSS:

 .stAppViewContainer { background-image: url("https://images.unsplash.com/photo-1732979887702-40baea1c1ff6?q=80&w=2832&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"); background-size: cover; color: black; } .stAppHeader { background-color: rgba(0, 0, 0, 0); } .st-ae { background-color: rgba(233, 235, 234, 0.895); } .st-emotion-cache-ysk9xe { color: black; } .st.info, .stAlert { background-color: black; } .st-key-scrape_button button { display: inline-block; padding: 10px 20px; font-size: 16px; color: #fff; background-color: #007bff; border: none; border-radius: 5px; cursor: pointer; animation: pulse 2s infinite; } .st-key-scrape_button button:hover { background-color: #0056b3; color: #fff; }


3. Koresha porogaramu ya Streamlit

Mububiko bwumushinga wawe, koresha itegeko rikurikira:

 streamlit run ui.py


Ibi bizatangiza seriveri yaho, kandi ugomba kubona URL muri terminal, mubisanzwe http://localhost:8501 . Fungura iyi URL muri mushakisha yawe kugirango uhuze na porogaramu y'urubuga.

Porogaramu ya AI Scraper

Urubuga rwo gusiba hamwe na Selenium

Ibikurikira, andika kode kugirango ukuremo HTML ibiri mururubuga urwo arirwo rwose ukoresheje Selenium. Ariko, kugirango code ikore, ukeneye Chrome WebDriver.

Shyiramo ChromeDriver ya Selenium

Selenium isaba WebDriver gukorana nurupapuro rwurubuga. Dore uko wabishyiraho:

  1. Kuramo ChromeDriver:
    Sura ibi Urubuga rwa ChromeDriver hanyuma ukuremo verisiyo ihuye na mushakisha ya Google Chrome.
  2. Ongeraho ChromeDriver kuri PATH


Nyuma yo gukuramo ChromeDriver, kura dosiye hanyuma wandukure izina rya dosiye isaba " chromedriver " hanyuma uyishyire mububiko bwumushinga wawe.

Iyo ibi birangiye, kora dosiye nshya yitwa main.py hanyuma ushyire mubikorwa code hepfo:

 from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # Function to scrape HTML from a website def scrape_website(website_url): # Path to WebDriver webdriver_path = "./chromedriver" # Replace with your WebDriver path service = Service(webdriver_path) driver = webdriver.Chrome(service=service) try: # Open the website driver.get(website_url) # Wait for the page to fully load WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "body"))) # Extract the HTML source html_content = driver.page_source return html_content finally: # Ensure the browser is closed after scraping driver.quit()


Bika kandi ukoreshe kode; ugomba kubona HTML zose zurupapuro wasibye zerekanwe muri progaramu yawe ya streamlit nkiyi:

Gukoresha Proxy itanga kuri Bypass kurubuga hamwe na Captcha na IP Bans

Mugihe ubu ushobora kugarura HTML yurubuga, code yavuzwe haruguru ntishobora gukora kurubuga rufite uburyo bugezweho bwo kurwanya scraping nkibibazo bya CAPTCHA cyangwa kubuza IP. Kurugero, gusiba urubuga nka Mubyukuri cyangwa Amazone ukoresheje Selenium bishobora kuvamo page ya CAPTCHA ibuza kwinjira. Ibi bibaho kuko urubuga rusanga bot igerageza kugera kubirimo. Niba iyi myitwarire ikomeje, urubuga rushobora guhagarika aderesi ya IP, bikabuza kwinjira.


Captcha


Kugira ngo ukosore ibi, shyira hamwe Bright Data's Scraping Browser mu nyandiko yawe. Mucukumbuzi ya scraping nigikoresho gikomeye gikoresha imiyoboro myinshi ya porokisi, harimo IP yo guturamo, kugirango wirengagize kwirwanaho. Ikemura impapuro zifungura mugucunga imitwe yihariye, gucapa urutoki rwa mushakisha, gukemura CAPTCHA, nibindi byinshi. Ibi byemeza ko imbaraga zawe zo gusiba ziguma zitamenyekanye mugihe ugera kubirimo nta nkomyi.

Gushiraho Bright Data's Scraping Browser kubuntu

  1. Kwiyandikisha - jya kuri Urupapuro rwibanze hanyuma ukande kuri “ Tangira Ikigeragezo Cyubusa ”. Niba usanzwe ufite konti hamwe na Bright Data, urashobora kwinjira gusa.

  2. Nyuma yo kwinjira, kanda kuri " Get Proxy Products ".


  3. Kanda kuri bouton " Ongeraho " hanyuma uhitemo " Scraping Browser ."


  4. Ibikurikira, uzajyanwa kurupapuro rwa " Ongeraho zone ", aho uzasabwa guhitamo izina rya porogaramu nshya ya scraping ya mushakisha. Nyuma yibyo, kanda kuri " Ongera ".


  5. Nyuma yibi, ibyangombwa bya proxy yawe bizashyirwaho. Uzakenera ibisobanuro birambuye mumyandikire yawe kugirango wirengagize uburyo ubwo aribwo bwose bwo kurwanya scraping bukoreshwa kurubuga urwo arirwo rwose.


Urashobora kandi kugenzura amakuru ya Bright Data yatezimbere ibyangombwa kugirango ubone ibisobanuro birambuye kubyerekeye gushakisha.


Muri dosiye yawe main.py , hindura kode kuriyi. Uzarebe ko iyi code ifite isuku kandi ngufi kuruta code yabanjirije.


 from selenium.webdriver import Remote, ChromeOptions from selenium.webdriver.chromium.remote_connection import ChromiumRemoteConnection from selenium.webdriver.common.by import By from bs4 import BeautifulSoup AUTH = '<username>:<passord>' SBR_WEBDRIVER = f'https://{AUTH}@brd.superproxy.io:9515' # Function to scrape HTML from a website def scrape_website(website_url): print("Connecting to Scraping Browser...") sbr_connection = ChromiumRemoteConnection(SBR_WEBDRIVER, "goog", "chrome") with Remote(sbr_connection, options=ChromeOptions()) as driver: driver.get(website_url) print("Waiting captcha to solve...") solve_res = driver.execute( "executeCdpCommand", { "cmd": "Captcha.waitForSolve", "params": {"detectTimeout": 10000}, }, ) print("Captcha solve status:", solve_res["value"]["status"]) print("Navigated! Scraping page content...") html = driver.page_source return html


Simbuza <username> na <password> hamwe nibisobanuro byawe bya mushakisha ukoresha hamwe nijambobanga.

Gusukura ibiri muri Dom

Nyuma yo gusiba HTML yibirimo kurubuga, akenshi iba yuzuyemo ibintu bitari ngombwa nka JavaScript, imiterere ya CSS, cyangwa tagi udashaka bidatanga umusanzu kumakuru yibanze ukuramo. Kugirango amakuru arusheho gutunganywa kandi afite akamaro mugutezimbere kurushaho, ugomba gusukura ibiri muri DOM ukuraho ibintu bidafite akamaro no gutunganya inyandiko.


Iki gice gisobanura uburyo bwo guhanagura ibiri muri HTML, gukuramo inyandiko ifite ireme, no kuyigabanyamo uduce duto two gutunganya hasi. Igikorwa cyogusukura ningirakamaro mugutegura amakuru kumirimo nko gutunganya ururimi karemano cyangwa gusesengura ibirimo.

Kode Yambere yo Kwoza Ibirimo DOM

Dore code izongerwaho kuri main.py kugirango ikore isuku yibirimo DOM:


 from bs4 import BeautifulSoup # Extract the body content from the HTML def extract_body_content(html_content): soup = BeautifulSoup(html_content, "html.parser") body_content = soup.body if body_content: return str(body_content) return "" # Clean the body content by removing scripts, styles, and other unwanted elements def clean_body_content(body_content): soup = BeautifulSoup(body_content, "html.parser") # Remove <script> and <style> tags for script_or_style in soup(["script", "style"]): script_or_style.extract() # Extract cleaned text with each line separated by a newline cleaned_content = soup.get_text(separator="\n") cleaned_content = "\n".join( line.strip() for line in cleaned_content.splitlines() if line.strip() ) return cleaned_content # Split the cleaned content into smaller chunks for processing def split_dom_content(dom_content, max_length=5000): return [ dom_content[i : i + max_length] for i in range(0, len(dom_content), max_length) ]


Icyo Kode ikora

  1. Gukuramo Ibirimo Umubiri:
    • Imikorere ya extrait_body_content ikoresha BeautifulSoup kugirango isesengure HTML hanyuma ikuremo ibiri muri tagi .
    • Niba tagi <body> ibaho, imikorere irayisubiza nkumugozi. Bitabaye ibyo, isubiza umugozi wubusa.
  2. Isuku Ibirimo:
    • Imikorere isukuye_umuntu_ibirimo itunganya ibintu byakuweho kugirango ikureho ibintu bitari ngombwa:
      • <script> na <style> tags zavanyweho kugirango zikureho JavaScript na CSS.
      • Igikorwa gikura inyandiko isanzwe mubirimo bisukuye.
      • Ihindura inyandiko mukwambura imirongo yubusa nu mwanya udasanzwe.
  3. Gutandukanya Ibirimo:
    • Imikorere ya split_dom_content ifata ibintu bisukuye ikabigabanyamo uduce duto hamwe nuburebure ntarengwa bwinyuguti 5.000.
    • Ibi ni ingirakamaro mugutunganya umubare munini winyandiko mubice bishobora gucungwa, cyane cyane iyo wohereje amakuru kuri moderi hamwe nikimenyetso cyangwa ingano yubunini.


Bika impinduka zawe kandi ugerageze gusaba. Ugomba kubona ibisohoka nkibi nyuma yo gusiba urubuga.

Kugereranya ibiri muri Dom kuri Ollama

Ibirimo DOM bimaze guhanagurwa no gutegurwa, intambwe ikurikira ni uguhuza amakuru kugirango ukuremo amakuru yihariye ukoresheje Ollama , ururimi runini rwicyitegererezo (LLM) rwahujwe na LangChain. Ollama nigikoresho cya CLI gikoreshwa mugukuramo no gukoresha LLMs mugace. Ariko, mbere yo gukoresha Ollama, ugomba gukora ibice bikurikira:


  • Niba utarabikora, kura hanyuma ushyire Ollama kuva kuri urubuga rwemewe . Urashobora kuyishyira kuri Mac ukoresheje itegeko Homebrew.

     brew install ollama
  • Ibikurikira, shyiramo icyitegererezo icyo aricyo cyose uru rutonde ; hari moderi nka Phi3, Mistral, Gemma 2, nibindi.; buriwese afite sisitemu isabwa. Iyi code ikoresha phi3 cyane cyane kuko yoroshye.

     ollama pull phi3


Nyuma yo kwishyiriraho, urashobora guhamagara kuri moderi uhereye kumyandikire yawe ukoresheje LangChain kugirango utange ubushishozi buva mumibare azoherezwa.


Dore uko washyiraho imikorere yo gusesengura DOM muburyo bwa phi3

Kode igenda kuri llm.py

Kode ikurikira ishyira mubikorwa logique yo gusesengura uduce twa DOM hamwe na Ollama no gukuramo amakuru arambuye:

 from langchain_ollama import OllamaLLM from langchain_core.prompts import ChatPromptTemplate # Template to instruct Ollama for parsing template = ( "You are tasked with extracting specific information from the following text content: {dom_content}. " "Please follow these instructions carefully: \n\n" "1. **Extract Information:** Only extract the information that directly matches the provided description: {parse_description}. " "2. **No Extra Content:** Do not include any additional text, comments, or explanations in your response. " "3. **Empty Response:** If no information matches the description, return an empty string ('')." "4. **Direct Data Only:** Your output should contain only the data that is explicitly requested, with no other text." ) # Initialize the Ollama model model = OllamaLLM(model="phi3") # Function to parse DOM chunks with Ollama def parse_with_ollama(dom_chunks, parse_description): prompt = ChatPromptTemplate.from_template(template) chain = prompt | model parsed_results = [] for i, chunk in enumerate(dom_chunks, start=1): if not chunk.strip(): # Skip empty chunks print(f"Skipping empty chunk at batch {i}") continue try: print(f"Processing chunk {i}: {chunk[:100]}...") # Print a preview print(f"Parse description: {parse_description}") response = chain.invoke( { "dom_content": chunk, "parse_description": parse_description, } ) print(f"Response for batch {i}: {response}") parsed_results.append(response) except Exception as e: print(f"Error parsing chunk {i}: {repr(e)}") parsed_results.append(f"Error: {repr(e)}") return "\n".join(parsed_results)

Icyo code ikora.

  1. Inyandikorugero:
    • Itanga ubuyobozi busobanutse kuri Ollama kumakuru yo gukuramo.
    • Menya neza ko ibisohoka bisukuye, bigufi, kandi bifitanye isano no gusobanura.
  2. Gutunganya ibice:
    • Imikorere ya parse_with_ollama isubiramo ikoresheje DOM ibice, itunganya buri hamwe na LLM.
    • Simbuka uduce duto kugirango uhindure imikorere.
  3. Gukemura Ikosa:
    • Gukemura amakosa neza, kuyandika, no gukomeza gutunganya uduce dusigaye.

Kuvugurura dosiye ui.py dosiye

Ongera kode ikurikira muri dosiye ya ui.py kugirango wemerere abakoresha kwinjiza amabwiriza ya parsing kuri LLM hanyuma urebe ibisubizo:

 from main import scrape_website, extract_body_content, clean_body_content, split_dom_content from llm import parse_with_ollama if "dom_content" in st.session_state: parse_description = st.text_area("Enter a description to extract specific insights from your scraped data:") if st.button("Parse Content", key="parse_button"): if parse_description.strip() and st.session_state.get("dom_content"): st.info("Parsing the content...") dom_chunks = split_dom_content(st.session_state.dom_content) parsed_result = parse_with_ollama(dom_chunks, parse_description) st.text_area("Parsed Results", parsed_result, height=300) else: st.error("Please provide valid DOM content and a description to parse.")

Uburyo Bikora muri UI

  1. Umukoresha Winjiza:
    • Umukoresha atanga imvugo karemano yamakuru yakuwe mubice byanditse.
  2. Parsing Trigger:
    • Iyo buto ya Parse Ibirimo ikanze, ibirimo DOM isukuye bigabanyijemo ibice byacungwa hanyuma bigahita kuri parse_with_ollama.
  3. Ibisubizo byerekana:
    • Ibisubizo bisesenguwe byerekanwe mumwanya winyandiko, byemerera abakoresha gusuzuma amakuru yakuwe.


Hamwe nibi bikorwa, scraper irashobora gutanga ibisubizo kubibazo byawe ukurikije amakuru yakuweho.


Ni iki gikurikiraho?

Gukomatanya gusiba urubuga na AI byugurura uburyo bushimishije kubushishozi bwashizweho namakuru. Usibye gukusanya no kubika amakuru, urashobora noneho gukoresha AI kugirango uhindure inzira yo kunguka ubushishozi uhereye kumakuru yakuweho. Ibi ni ingirakamaro mu kwamamaza no kugurisha amakipe, gusesengura amakuru, ba nyiri ubucuruzi, nibindi byinshi.


Urashobora kubona code yuzuye ya scraper ya AI hano. Wumve neza ko ugerageza nayo kandi uyihuze nibyo ukeneye bidasanzwe. Intererano nazo ziremewe-niba ufite ibitekerezo byo kunonosora, tekereza gukora icyifuzo cyo gukurura!


Urashobora kandi gufata ibi kure. Dore ibitekerezo bimwe:

  • Iperereza hamwe na Prompts: Hindura ibyifuzo byawe kugirango ukuremo ubushishozi bwihariye cyangwa ukemure ibyifuzo byumushinga udasanzwe.
  • Umukoresha Imigaragarire
  • Huza izindi Moderi za LLM: Shakisha izindi ngero zururimi nka Gufungura , Gemini , nibindi kugirango turusheho kunoza isesengura ryamakuru yawe.