Yaqhip sitios web ukanakax Selenium, Puppeteer ukat ukham uñtasitanak apnaqañakiw chiqak raspachañatakix utji, yaqha sitios web ukanakax nayrar sartañ seguridad ukarjam phuqhapxi, CAPTCHAs ukat IP jark’awinakas ch’amakïspawa. Uka jan walt’awinak atipjañataki ukhamarak 99% sitios web ukar inakiw Raspador apnaqañamp raspachañax wakisi, ukax aka qillqatanx lurasiniwa, ukatx mä...
Ukampirus, yatiyawinak apthapiñax mä thakhikiwa; kuntix uka datos ukamp lurapkta ukax pachpakiwa, jan ukax juk’ampi, wali wakiskiriwa. Jila partejja, ukatakejj walja yatiyäwinak amparampiw wal chʼamachasiñasa. Ukampis ¿uka lurañ automáticamente lurasma ukhajja, ¿kamachasmasa? Mä modelo de lengua (LLM) ukar aprovechasax janiw datos ukanak apthapiñakikiti jan ukasti jiskt’asiraksnawa, ukhamat aski amuyunaka apsuñataki —tiempo ukat ch’ama qhispiyañataki.
Aka guia ukanx kunjams web scraping ukax AI ukampiw mayacht’asispa uk yatxatapxäta, mä ch’aman herramienta lurañataki, ukax datos ukanak apthapiñatakiwa ukat escala ukan inakiw uñakipañataki. ¡Uñtʼayasiñäni!
Janïr qalltkasaxa, aka tuqinakatwa yatxatañama:
Aka yatichawimpi sarantañatakixa, aka lurawi phuqhañawa:
Aka lurawinakampi phuqhañamawa pachamamaru utt’ayañataki ukhamaraki wakicht’asiñamawa AI-ampi ch’amanchata raspador lurañataki.
Nayraqatxa, mä virtual ukan utt’ayañamawa, proyecto ukan dependencias ukanakap apnaqañataki. Ukhamatwa taqi wakiskir paquetes ukanakatakix mä sapa chiqawj utjañapa.
Mä machaq proyecto directorio luraña:
Terminal ukar jist’arañamawa (jan ukax Command Prompt/PowerShell Windows ukan) ukat machaq directorio proyecto ukar lurañamawa:
mkdir ai-website-scraper cd ai-website-scraper
Uka virtual ukan pacha luraña:
Aka kamachixa virtual pacha lurañatakixa phuqhañawa:
Windows ukanx:
python -m venv venv
macOS/Linux ukanx:
python3 -m venv venv
Ukax mä carpeta venv
ukaw lurasi, ukax virtual ukan pachaparuw imatäni.
Uka taypin irnaqañ qalltañatakix virtual medio ambiente ukar ch’amanchañamawa:
Windows ukanx:
.\venv\Scripts\activate
macOS/Linux ukanx:
source venv/bin/activate
Terminal ukan jiskt’awipax uñacht’ayañatakiw mayjt’ayasini ( venv
), jichhax virtual ukan manqhankatamat chiqanchañataki.
Jichhaxa, proyectoman munaski uka bibliotecas ukanakaw utji. Mä requirements.txt
qillqat proyecto directorio ukan luraña ukat aka dependencias ukanakamp yapxataña:
streamlit selenium Beautifulsoup4 langchain langchain-ollama lxml html5lib
Aka paquetes ukax wali wakiskiriwa raspado, datos proceso ukat UI lurañataki:
streamlit : Akax mä interfaz de usuario interactivo lurañatakiw apnaqasi.
Selenium : Sitio web ukan contenido ukar raspañataki.
beautifulsoup4 : HTML ukar uñakipt’añataki ukhamarak q’umachañataki.
langchain ukat langchain-ollama : Akax Ollama LLM ukamp chikt’atäñatakiwa ukat qillqat lurañatakiw.
lxml ukat html5lib : HTML nayrar sartañ uñakipañataki.
Uka dependencias ukanakaxa aka kamachimpiwa utt’ayata:
(Janiräkipanxa, kawkhantï qillqatax utjki uka carpetaruw mantañama.)
pip install -r requirements.txt
Mä qillqat ui.py sutimp lurañaw proyecto directorio ukan. Aka script ukax UI ukax raspador ukatakiw qhanañchasi. Aka amparamp qillqt’at chimpumpiw aplicación ukar uñt’ayañax wakisi:
import streamlit as st import pathlib from main import scrape_website # function to load css from the assets folder def load_css(file_path): with open(file_path) as f: st.html(f"<style>{f.read()}</style>") # Load the external CSS css_path = pathlib.Path("assets/style.css") if css_path.exists(): load_css(css_path) st.title("AI Scraper") st.markdown( "Enter a website URL to scrape, clean the text content, and display the result in smaller chunks." ) url = st.text_input(label= "", placeholder="Enter the URL of the website you want to scrape") if st.button("Scrape", key="scrape_button"): st.write("scraping the website...") result = scrape_website(url) st.write("Scraping complete.") st.write(result)
Jupanakatxa componentes streamlit ukanakat juk’amp yatxatapxasmawa
Aplicación ukar estilo lurañatakix, mä carpeta de activos ukar proyecto directorio ukan lurañamawa ukat mä style.css archivo ukar yapxatañamawa. Streamlit ukax CSS ukampiw uñt’ayasi:
.stAppViewContainer { background-image: url("https://images.unsplash.com/photo-1732979887702-40baea1c1ff6?q=80&w=2832&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"); background-size: cover; color: black; } .stAppHeader { background-color: rgba(0, 0, 0, 0); } .st-ae { background-color: rgba(233, 235, 234, 0.895); } .st-emotion-cache-ysk9xe { color: black; } .st.info, .stAlert { background-color: black; } .st-key-scrape_button button { display: inline-block; padding: 10px 20px; font-size: 16px; color: #fff; background-color: #007bff; border: none; border-radius: 5px; cursor: pointer; animation: pulse 2s infinite; } .st-key-scrape_button button:hover { background-color: #0056b3; color: #fff; }
Proyecto directorio ukanxa, aka kamachi phuqhañawa:
streamlit run ui.py
Ukax mä local servidor ukar uñstayañapawa, ukatx mä URL terminal ukan uñjañaw wakisi, jilapart http://localhost:8501
. Aka URL ukax navegador ukan jist’aratawa, web ukan apnaqawimp chikt’atäñataki.
Ukxarusti, Selenium apnaqasa kuna web ukan HTML ukan utjki ukanak apsuñatakix código qillqt’aña. Ukampirus, código ukax irnaqañapatakix mä Chrome WebDriver ukawa.
Selenium ukax mä WebDriver ukaruw munaraki, ukax web ukan uñt’ayatawa. Akax kunjams wakicht’asispa:
ChromeDriver apsuñ tukuyasaxa, archivo apsuña ukatxa aplicación archivo suti “ chromedriver ” ukxa copiaña ukatxa proyecto carpeta ukar uchaña.
Kunawsatix ukham luratäki ukhax mä machaq qillqat main.py
ukham lurañaw ukat aka amparamp qillqt’at chimpunak phuqhaña:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC # Function to scrape HTML from a website def scrape_website(website_url): # Path to WebDriver webdriver_path = "./chromedriver" # Replace with your WebDriver path service = Service(webdriver_path) driver = webdriver.Chrome(service=service) try: # Open the website driver.get(website_url) # Wait for the page to fully load WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "body"))) # Extract the HTML source html_content = driver.page_source return html_content finally: # Ensure the browser is closed after scraping driver.quit()
Uka codigox imaña ukat apnaqaña; taqi HTML ukax panka raspado uñacht’ayat streamlit aplicación ukan akham uñt’ayañapawa:
Jichhax mä sitio web ukan HTML ukar apsuñjamakchisa, aka pata tuqinkir código ukax inas jan irnaqkchiti sitios ukanakatakix nayrar sartañ mecanismos anti-raspado ukanakampi kunjamakitix CAPTCHA ukan jan walt’awinakapa jan ukax IP jark’awinaka. Amuyt’añataki, Indeed jan ukax Amazon ukham Selenium apnaqañ sitio ukar raspañax mä CAPTCHA ukan pankapar mantañ jark’aqaspawa. Ukax lurasiwa kunatix sitio web ukax mä bot ukan contenido ukar mantañ yant’atap uñt’ayi. Uka sarnaqawix utjaskakispa ukhaxa, sitio ukax qhipharux IP ukar jark’aqaspawa, juk’amp mantañ jark’añataki.
Uka askichañatakixa, mayachañawa
Qillqt’asiña — saraña
Ukar mantañ tukuyasaxa, “ Get Proxy Products ” ukar ch’iqt’añamawa.
“ Agregar ” siski ukaruw chʼetjjtäta ukat “ Navegador de raspado ” siski ukaruw chʼetjjtäta.
Ukxarusti, “ Zona yapxataña ” uksaruw apataraki, ukanx machaq zona proxy de navegador de raspado ukatakix mä suti ajlliñaw wakisi. Uka qhepatjja, “ Agregar ” siski ukaruw chʼetjjtäta.
Uka qhipatxa, zona proxy ukan credenciales ukanakax luratäniwa. Uka detalles ukanakax script ukanx munasiniw kuna mecanismos anti-raspado ukanakax kuna sitio web ukan apnaqatäki ukanakar jan yäqañataki.
Ukhamarakiw Bright Data ukan desarrollador ukan qillqatanakap uñakipt’asma, juk’amp yatxatañatakix navegador de raspado uka tuqita.
main.py
qillqatanx, uka chimpux mayjt’ayatawa. Uka codigox nayra codigot sipanx juk’amp q’uma ukat jisk’akiwa sasaw amuyapxäta.
from selenium.webdriver import Remote, ChromeOptions from selenium.webdriver.chromium.remote_connection import ChromiumRemoteConnection from selenium.webdriver.common.by import By from bs4 import BeautifulSoup AUTH = '<username>:<passord>' SBR_WEBDRIVER = f'https://{AUTH}@brd.superproxy.io:9515' # Function to scrape HTML from a website def scrape_website(website_url): print("Connecting to Scraping Browser...") sbr_connection = ChromiumRemoteConnection(SBR_WEBDRIVER, "goog", "chrome") with Remote(sbr_connection, options=ChromeOptions()) as driver: driver.get(website_url) print("Waiting captcha to solve...") solve_res = driver.execute( "executeCdpCommand", { "cmd": "Captcha.waitForSolve", "params": {"detectTimeout": 10000}, }, ) print("Captcha solve status:", solve_res["value"]["status"]) print("Navigated! Scraping page content...") html = driver.page_source return html
<usuario> ukat <contraseña> ukax raspado navegador ukan apnaqirin sutipampi ukat contraseña ukamp turkañawa.
Mä sitio web ukan HTML ukan contenido ukar raspado tukuyatatxa, walja kutiw jan wakiskir elementos ukanakamp phuqhantat uñjasi, JavaScript, estilos CSS, jan ukax jan munat etiquetas ukanakax janiw yanapt’kiti kuna yatiyawinaktix apsusipkta ukaru. Datos ukanakax juk’amp wakicht’atäñapataki ukat juk’amp lurañatakix askïñapatakix, DOM ukan contenido ukar q’umachañax wakisiwa, jan wakiskir elementonak apsusaw ukat qillqat wakicht’añaw wakisi.
Aka t’aqax kunjams HTML ukan q’umachañapa, aski qillqatanaka apsuña, ukatx jisk’a t’aqanakaruw jaljañapa, ukax alaya tuqin lurañatakiw qhanañchi. Q’umachañ lurawix lurawinakatakix datos wakicht’añatakix wali wakiskiriwa, kunjamatix natural lengua ukar apnaqaña jan ukax contenido uñakipaña.
Akax mä código ukawa, ukax main.py ukar yapxatatawa, DOM ukan q’umachañ apnaqañataki:
from bs4 import BeautifulSoup # Extract the body content from the HTML def extract_body_content(html_content): soup = BeautifulSoup(html_content, "html.parser") body_content = soup.body if body_content: return str(body_content) return "" # Clean the body content by removing scripts, styles, and other unwanted elements def clean_body_content(body_content): soup = BeautifulSoup(body_content, "html.parser") # Remove <script> and <style> tags for script_or_style in soup(["script", "style"]): script_or_style.extract() # Extract cleaned text with each line separated by a newline cleaned_content = soup.get_text(separator="\n") cleaned_content = "\n".join( line.strip() for line in cleaned_content.splitlines() if line.strip() ) return cleaned_content # Split the cleaned content into smaller chunks for processing def split_dom_content(dom_content, max_length=5000): return [ dom_content[i : i + max_length] for i in range(0, len(dom_content), max_length) ]
Kuns Código ukax Luri
Mayjt’awinakam imañamawa ukat aplicación ukar yant’añamawa. Mä sitio web ukar raspañ tukuyasax mä salida ukham jikxatañamawa.
Mä kutix DOM ukax q’umachatawa ukatx wakicht’atarakiwa, jutir amtawix yatiyawinak uñakipt’añawa, ukhamat apnaqasa específicos detalles ukanaka apsuñataki
Janitï ukhamäkchi ukhaxa, Ollama ukar apkatañamawa ukat instalañamawa
brew install ollama
Ukxarusti, kuna modelos ukanakat instalaña
ollama pull phi3
Instalación tukuyatatxa, uka modelo ukarux script ukanx jawst’asmawa LangChain uka apnaqasa, ukax datos ukanakat aski amuyunaka churañataki, ukax ukaruw apayani.
Akax kunjams DOM ukan contenido ukar phi3 modelo ukar uñakipañatakix funcionalidad ukar wakicht’añax
Aka codigox Ollama ukamp DOM chunks ukar parse ukat wakiskir detalles ukar apsuñatakix lógica ukar phuqhi:
from langchain_ollama import OllamaLLM from langchain_core.prompts import ChatPromptTemplate # Template to instruct Ollama for parsing template = ( "You are tasked with extracting specific information from the following text content: {dom_content}. " "Please follow these instructions carefully: \n\n" "1. **Extract Information:** Only extract the information that directly matches the provided description: {parse_description}. " "2. **No Extra Content:** Do not include any additional text, comments, or explanations in your response. " "3. **Empty Response:** If no information matches the description, return an empty string ('')." "4. **Direct Data Only:** Your output should contain only the data that is explicitly requested, with no other text." ) # Initialize the Ollama model model = OllamaLLM(model="phi3") # Function to parse DOM chunks with Ollama def parse_with_ollama(dom_chunks, parse_description): prompt = ChatPromptTemplate.from_template(template) chain = prompt | model parsed_results = [] for i, chunk in enumerate(dom_chunks, start=1): if not chunk.strip(): # Skip empty chunks print(f"Skipping empty chunk at batch {i}") continue try: print(f"Processing chunk {i}: {chunk[:100]}...") # Print a preview print(f"Parse description: {parse_description}") response = chain.invoke( { "dom_content": chunk, "parse_description": parse_description, } ) print(f"Response for batch {i}: {response}") parsed_results.append(response) except Exception as e: print(f"Error parsing chunk {i}: {repr(e)}") parsed_results.append(f"Error: {repr(e)}") return "\n".join(parsed_results)
Aka chimpu ui.py qillqataru yapxataña, ukhamata apnaqirinakaxa LLM ukarux parsing yatichäwinak mantapxañapataki ukat resultados uñakipapxañapataki:
from main import scrape_website, extract_body_content, clean_body_content, split_dom_content from llm import parse_with_ollama if "dom_content" in st.session_state: parse_description = st.text_area("Enter a description to extract specific insights from your scraped data:") if st.button("Parse Content", key="parse_button"): if parse_description.strip() and st.session_state.get("dom_content"): st.info("Parsing the content...") dom_chunks = split_dom_content(st.session_state.dom_content) parsed_result = parse_with_ollama(dom_chunks, parse_description) st.text_area("Parsed Results", parsed_result, height=300) else: st.error("Please provide valid DOM content and a description to parse.")
Ukham lurasax, jichhax raspador ukax jiskt’awinakamar jaysäwinak churaspa, ukax datos raspado ukarjam luratawa.
Web scraping ukat AI ukanakan mayacht’asitapax datos-driven insights ukanakatakix kusiskañ posibilidades ukanakaw jist’arasi. Datos apthapiñat sipansa ukat imañat sipansa, jichhax AI ukax aprovechasispawa, ukhamat datos raspado ukanakat amuyt’awinak jikxatañ thakhix juk’amp sumaptañapataki. Akax qhathunakataki ukhamarak aljirinakatakix wali askiwa, datos uñakipañataki, negocios ukankirinakataki, ukat juk’ampinakataki.
Aka chiqanx AI raspador ukatakix taqpach código ukaw jikxatasi. Jan axsarasaw yantʼañama ukat kunatï jumatak wakiski ukarjamaw lurañama. Yanapt’awinakax wali askiwa —jumatix juk’amp askinak lurañ amtanakax utjktam ukhax mä jaqukipañ mayiw lurañ amtañamawa!
Uka tuqitxa juk’ampiruw apasxaraksna. Akax mä qawqha amuyunakawa: