Uma ungasebenza ku-web data, uyazi ukuthi iyiphi isigaba yokuqala se-product emangalisayo kakhulu. Ngokuvamile, idatha e-web-scraping ibhekwa emaphaketheni noma amabhasi (i-cloud storage buckets, ama-data lakes, noma ama-data warehouses) bese ifakwe ngokusebenzisa izixhobo ze-Business Intelligence (BI), ezimbonini zokuhweba kanye nezinsizakalo. Ngokwesibonelo, iqembu ingathola imizuzu yemikhiqizo noma ama-customer reviews kusuka ku-web, ukugcina idatha ebomvu njenge-CSV / JSON ifayela, ukulanda ku-SQL data warehouses, bese usebenzisa amaphepha le-BI njenge-Tableau noma i-Power BI ukudala amaphepha kanye nama-rapports. Iz Ngaphezu kwalokho, i-Language Models (LLMs) iyahlanza le paradigm. Ngaphandle kokufikelela kuphela ku-dashboard ye-static noma i-SQL queries, amazwe angasebenzisa ama-AI assistants ukufumana iziphakamiso kusuka ku-data e-scraped nge-natural language prompts. Ngokuvamile, engaphezulu kwe-human writing a query noma interpreting a chart, i-AI assistant ingatholela imibuzo ngokuvamile mayelana needatha. Qinisekisa ukufinyelela kwe-ChatGPT-like interface yakho futhi ubhalise iziphakamiso ezimbalwa ukuze uthole iziphakamiso, ukufinyelela ku-dashboard creation. Ngithole uhlobo le ndlela ezininzi ngaphambi kwe-LLMs ezivela kodwa ngaphandle kokuphumelela kakhulu. I- Ngibuyekeza ukuthi, Ngaphandle kokufumana ifayibha elifanayo noma ingozi ye-Excel ngokuvumelana nokufumana impendulo. Lolu hlelo linikeza ukufinyelela okusheshayo, okuzenzakalelayo kwebhizinisi kubasebenzisi ezingaphakathi kwezobuchwepheshe ngaphandle kokufaka imibhalo yokwakha noma ukhiye ikhodi. Kwiinkampani ezidlulayo lapho ngitholile, ngitholela izikhathi eziningi ukuxuba imibhalo ezahlukene ngamakhasimende ngamakhasimende (uma kungekho ngamakhasimende ngamakhasimende), ngezinye izikhathi ukuchithwa izinhlayiya ezahlukene ngamunye. "Ukuhlinzeki wayo waya i-price ephakeme kwiveki edlule?" Ngokuvamile, isixazululo esitsha esithathwe: Ngiya ku-hallucinations? Uma akuyona inombolo esifundeni esifundeni, singakwazi ukuthi inombolo esifundeni 100%? Kule post (ne-ne-next in the Lab series), siza ukwakha inqubo end-to-end lapho siza kuqala nge-scraping i-articles e-newsletter, ukugcina ku-database efanelekayo ukusetshenziswa kwe-AI, ukufumana okuhlobisa, bese ukubhala i-web app okuyinto ingasebenzisa inguqulo oluthile ye-GPT model. Ukuphuculwa kwe-LLM Knowledge Ukuhlanganisa idatha eyenziwe (njenge dataset yakho eyenziwe) ku-LLM kungenziwa ikakhulukazi ngezindlela ezimbili: by model noma usebenzisa , futhi zonke lezi zindlela zihlanganisa izinzuzo kanye nezinzuzo. Sicela uchofoza kanjani zihlanganisa futhi okuyinto ingcindezi kungenziwa engcono isicelo lethu. fine-tuning Retrieval-Augmented Generation (RAG) Fine-Tuning vs. Retrieval-Augmented Generation Ngokuvamile, uzothola imodeli e-pre-trained futhi uzothola imodeli ku-domain-specific dataset yakho, ukuguqulwa izicathulo zayo ukuze zihlanganisa ukuthi ulwazi. Ngokwesibonelo, uma u-scraped isithombe se-technical articles, ungakwazi ukucubungula i-LLM kulezi zebhizinisi. Ngemva kwe-fine-tuning, imodeli I-Fine-tuning yenzelwe ngokuvamile ngokuvumela i-model nge-set enkulu ye-question-answer pairs noma i-text passages kusuka ku-data yakho ukuze uyifunde ukujabulela okuqukethwe okuqukethwe. Ukulungiselela elilandelayo imodeli, kungenziwa kulinganiselwe kulinganiselwe kulinganiselwe kulinganiselwe kulinganiselwe kulinganiselwe kulinganiselwe kulinganiselwe kulinganiselwe kulinganiselwe. Fine-tuning intrinsically knows augmented from the inside isicelo esahlukile: imodeli akuyona ongahlukile, kodwa sinikezela ukufinyelela kwebhizinisi lwezenzululwazi (ngaphezulu nge-vector search). Uma isibuyekezo ivela, uhlelo Ngemuva kwalokho, i-LLM ikhiqiza impendulo yayo ngokuvumelana ne-context ephakeme. Ukuze ama-boomers afana nathi, ukuthatha isithombe se-CD emzimbeni we-Neo ku-Matrix ukufundisa izinzuzo ezintsha kanye nokushintshwa. Kwesibonelo lethu, isisekelo se-knowledge kungase yi-collection ye-web articles eyenziwe ku-database ye-specialized. I-RAG kuyinto njenge-exam ye-open-book ye-LLM - ngesikhathi lokufunda itholisa "isikhasi" yedatha ephakeme futhi isetshenziselwa ukusabela impendulo, ngaphandle kokuqiniseka kuphela ku-memory yayo. Retrieval-Augmented Generation (RAG) retrieve relevant documents Njengoba ungathanda, isizinda esiyingqayizivele kuyinto lapho ulwazi oluthile lihlala: nge-fine-tuning, ulwazi lihlala (izinga ze-model zithunyelwe). Nge-RAG, ulwazi lihlala I-Fine-tuning kuyinto efana nokufundisa imodeli izinto ezintsha ngempumelelo, lapho i-RAG kuyinto efana nokufundisa imodeli nge-library enhle eyenza ku-fly. in the model itself external Izindlela ezimbili zihlanganisa izinzuzo kanye nezinzuzo ezahlukene: Fine-Tuning: Once fine-tuned, the model can respond faster and more integratedly to new knowledge. It doesn't need lengthy prompts with documents each time. A well-fine-tuned model typically outperforms the base model on domain-specific questions because it has a deeper understanding of that niche terminology and content. Pros: Fine-tuning can be resource-intensive and time-consuming – you need sufficient training data and computing power (or budget if using a service). It also makes the model static concerning that training snapshot. If your scraped data changes or new information comes in, you’d have to fine-tune again to update the model. There’s also a risk of the model or overriding some of its original knowledge if not carefully managed. Importantly, fine-tuning means your data becomes part of the model’s parameters, which could be a privacy concern if the model weights are exposed or if using a third-party service to fine-tune (your data is uploaded for training). Last but not least, once the knowledge is embedded in the model, you cannot cite any article used to improve it. Cons: forgetting Retrieval-Augmented Generation (RAG): No need to modify the LLM itself – you leave the base model as-is and simply provide relevant context at query time. This makes updating or expanding the knowledge base easy: add or remove documents in your external index, and the model will use the latest data. It’s very flexible and (which can be more secure). RAG can reduce hallucinations by grounding answers in real sources – essentially, the model has the “receipts” to back up its answer. It also typically requires less upfront work than full fine-tuning; most effort goes into setting up the retrieval system. Pros: keeps your proprietary data external RAG introduces more moving parts – you need a system for embedding and indexing documents and retrieving them at runtime. At query time, you pay the cost in latency and token length for feeding documents into the model’s prompt. If the retrieved documents aren’t relevant (due to a bad query or vector mismatch), the answer will suffer. The LLM is also limited by its input size; if the documents plus question exceeds the model’s context window, you might have to truncate or select fewer documents. Additionally, the raw text of documents might influence the model's style, which could lead to less coherent or conversational answers unless you prompt it to refine the wording. Cons: Ngokuphathelene, ukucubungula okuhlobene kuhlobonakalayo, lapho i-RAG inikeza ukufinyelela kwezobuchwepheshe ezingenalutho. Ukuze isibonelo lethu yokusebenzisa idatha eyenziwe ngempumelelo, i-RAG ibonisa indlela engcono kakhulu: ungathola idatha ezintsha zewebhu ngokushesha futhi ungenza umphathi wakho usebenzisa ngokushesha kunokuba ngempumelelo umklamo jikelele. Ngaphambi kokuphumelela, kubalulekile ukubonisa ukuthi i-fine-tuning ne-RAG ayinezinhlanganisela ngamunye; zihlanganisa ngamunye. Ngokwesibonelo, ungayifaka i-model ukuze ukuguqulele i-tone yayo noma umthamo yokuhambisana nezinqubo (noma ukongeza izazi ezincinane nezakhiwo), futhi usebenzisa i-RAG ukuze inikeze ukufinyelela kwebhasi elikhulu le-knowledge that updates frequently. Ngokuvamile, i-RAG kuphela ikakhulukazi inikeza indlela elula futhi ephakeme kakhulu yokuvumela umdlavuza we-AI ukuxhuma ulwazi olufanelekileyo, okuyinto isitimela esithathwe ku-imeyili yethu. Ukusetshenziswa kwe-Local Model vs. I-API ye-External Ukubuyekezwa okunye for your AI assistant: a local (open-source) model you run yourself or a hosted model via an API (like OpenAI's GPT-3.5/GPT-4 noma abanye). Zonke fine-tuning futhi RAG kungenziwa nge noma, kodwa kukhona izivumelwano: what LLM to use – Amamodeli efana ne-LLaMA 2, i-Mistral, noma i-Falcon angasebenza ku-server yakho. I-benefit enkulu apha . I-Data yakho e-scraped ayidlulele emkhakheni yakho, okuyinto kubalulekile uma iqukethe ulwazi oluthile. Ungasiza ukucubungula ngokushesha ku-data yakho noma ukuguqulwa indlela yokusebenza. Ngokuhambisana nezimali, ekusebenzisweni kwe-model yendawo kungabangela i-cost for amakhulu amayunithi (hhayi ama-API amayunithi), kodwa uzodingeka ku-hardware noma i-cloud infrastructure ukuze i-host. Umthombo emininzi ebonakalayo ayikwazi ukujabulela ukusebenza kwe-GPT ezintsha. Ungafuna ukusebenzisa amamodeli amakhulu noma amakhulu e-specialized ukuze uthole ukusebenza okufanayo, okuyinto ingangcono ukulawula. Ngaphezu kwalokho, ukugcinwa nokuvakashwa kwe-model Nge-domain-specific dataset kanye ne-expertise, i-model ye-local ingasetshenziselwa ukunambitheka ku-domain elilandelayo, okwenza kube isixazululo enhle ye-"private GPT". Local Open-Source LLMs control and privacy Waze – Ukusetshenziswa kwe-API efana ne-GPT-4 ye-OpenAI kungenziwa ukuthi akufanele ukunakekelwa kwemodeli; ungakwazi kuphela ukuthatha imibuzo yakho ekusebenzeni kwabo futhi ufike ukufinyelela. Lokhu kunokwenzeka kakhulu futhi ngokuvamile inikeza ukufinyelela kwekhwalithi ye-modele ephakamileyo ngaphandle kokuphumelela kwezinto zokufakelwa. Ngokuvamile, ungasebenzisa i-RAG ngokuvamile ngokuvimbela idokhumenti ezivela ku-prompt yakho futhi udinga i-API ukuxhumana. Imibuzo aphelele ku-customization kanye ne-privacy. Akukho zonke amamodeli engatholakala ukulungiselela ( Ngaphezu kwalokho, unemibuzo yobumfihlo yobumfihlo yobumfihlo yobumfihlo yobumfihlo yobumfihlo yobumfihlo yobumfihlo yobumfihlo yobumfihlo yobumfihlo. External API LLMs (e.g. OpenAI’s) I-OpenAI, isibonelo, ivumela ukwenza lokhu ku-GPT4-o ne-GPT4-mini I-OpenAI, isibonelo, ivumela ukwenza lokhu ku-GPT4-o ne-GPT4-mini Ngokuvamile, uma isibonelo sakho sokusebenzisa idatha emangalisayo noma inikeza ukulawula ngokuphelele, i-LLM yendawo iyatholakala ngaphandle kokuqinisekisa okwengeziwe. Uma umphumela yakho kuyinto ikhono lwezilinganiso olungcono kanye nokufaka okusheshayo, isampula esitholile efana ne-OpenAI kungenzeka ukhetho olungcono. Kule isicelo se-OpenAI, sincoma ukusetshenziswa kwe-GPT API ye-OpenAI ngenxa yobuningi nomgangatho, kodwa uhlelo lokufaka singasungula kungenziwa ngempumelelo kwama-open-source model efana ne-Llama2 nge-HuggingFace noma ama-LangChain amabhizinisi. I-mecanism yokufaka (i-vector database + i-similarity search) Ngokusho lezi zihlanganisa, sicela ukulungiselela imibhalo yami esithambile ku-AI assistant. Siza kusebenzisa indlela ye-RAG nge-OpenAI model, okuyinto zihlanganisa kahle nge-web data eyakhiwe ngokushesha futhi ukunciphisa ukunemba kwezinto ezingenalutho ze-fine-tuning. Ukukhishwa kwe-TWSC nge-Firecrawl is a web scraping engine ebonakalayo njenge-REST API kanye ne-SDK. It yenzelwe ikakhulukazi ukubuyekeza iwebhusayithi (ngama-formats njenge-pure text noma i-markdown), ukulawula zonke i-lifting emikhulu, njenge-crawling links, ukubonisa i-JavaScript, njll. I-Firecrawl ye-avantage enhle kuyinto ukuthi nge-API eyodwa, ungahambisa i-site ephelele. Lokhu kwenza ngcono ukuchitha ngokushesha yonke impahla kusuka ku-newsletter yami ukuze ifayile ku-AI. Ukuhlobisa LLM-ready data Ukuhlobisa Ukuhlobisa Ukuze i-The Web Scraping Club blog, siza kusebenzisa i-sitemap yokufunda zonke i-URL ye-artikel. (I-blog ibhekwa ku-Substack, okuyinto inikeza i-XML sitemap elihlanganisa zonke i-posts.) I-Firecrawl ingasebenzisa indawo ngaphandle kwe-sitemap, kodwa ukusetshenziswa kwalo njenge-startpoint kungenziwa ngempumelelo futhi sikutholele ukuthi akuyona ezinye amakhasi. Okokuqala, sinikeza i-Firecrawl ngokufaka i-Python SDK yayo kanye nokuthintela nge-API key (ngokusho ukuba ufike futhi ufike key): from firecrawl import FirecrawlApp import os os.environ["FIRECRAWL_API_KEY"] = "YOURKEY" # or load from .env app = FirecrawlApp() # Define the sitemap URL (we can loop this for multiple years if needed) map_result = app.map_url('https://substack.thewebscraping.club/sitemap.xml', params={ 'includeSubdomains': True }) print(map_result) for article in map_result['links']: if '/p/' in article: print(article) response = app.scrape_url(url=article, params={'formats': [ 'markdown' ]}) Ngezinye amayunithi amayunithi amancane, izihloko zethu zitholakala ku-Markdown format. Ukukhetha Database ye-Vector ye-RAG I-A is a key component of RAG implementation. It ibhekwa idokhumenti yakho (i-vector representations) futhi ivumela ukubuyekeza ngokushesha ukufinyelela idokhumenti ezibalulekile for a given query embedding. Izinketho eziningi zitholakala, kuhlanganise amabhizinisi open-source kanye namasevisi managed cloud, kodwa for our implementation, thina usebenzisa . vector database Pinecone I-A Ngokungafani ne-open-source databases eyenza i-self-hosting, i-Pinecone iyisisombululo ye-cloud-native, okungenani akufanele ukunakekelwa kokuhlakazeka kokuhlakazeka kwe-infrastructure. I-Pinecone kusebenza ngokuvamile ngokubethelwa kwe-indexing, ukunakekelwa kwe-latency ye-search, ne-scaling. ikhaya fully managed vector database ikhaya Ukukhishwa Pinecone Izinyathelo zokuqala zihlanganisa Pinecone futhi uthole i-API_KEY ne-environment kusuka ku-dashboard. Ngemuva kwalokho, singakwazi ukufaka i-Python SDK njengezimo pip install pinecone Ngemuva kwalokho, singakwazi ukuxhumanisa Pinecone ku-script yethu from pinecone import Pinecone, ServerlessSpec pc = pinecone.Pinecone( api_key="YOUR API KEY" ) I-Environment Value iyatholakala ku-Pinecone Web Console lapho ufakele i-API key yakho. Ukwenza i-Pinecone Index I-indice yindawo lapho idatha yakho ibhaliswe ngenxa ye-retrivial elandelayo kusuka ku-LLM. Ngaphandle kokuthunyelwe ku-text, kuyinto ku-vectorial format (ngokusekelwe ku-string ye-number), okuvumela i-LLM ukuhlola ukuthi inguqulo ku-indice iyiphi inguqulo enhle ye-question eyenza i-LLM. Nge A Ngiya , ngaphandle kwegama we-vector, sinayo futhi i-metadata: i-metadata iyona isitoreji esihlalweni ngamunye we-vector (i-embedding) ukuze yenza ukubuyekeza okungenani kakhulu. vector database Pinecone, , or Ngena ngemvume Ukuhlobisa extra information Ngena ngemvume Ngena ngemvume Ukuhlobisa Ukuhlobisa Nokho Ukubonisa Ukusetyenziswa kwama-similarity search, inikeza ulwazi oluhlanganisiwe mayelana ne-vector. Lokhu kuvumela Uma idatha we-vector uzothola ukubuyekeza imibhalo, ku-metadata siza kusetshenziselwa ulwazi olulodwa kumadokhumenti we-LLM, njenge-author ye-blog esetshenziselwa ukubuyekeza, umbhalo kanye ne-link ku-post. vectors numerical embeddings metadata filtering, categorization, and interpretability index_name = "article-index" if not pc.has_index(index_name): index_model = pc.create_index_for_model( name=index_name, cloud="aws", region="us-east-1", embed={ "model":"llama-text-embed-v2", "field_map":{"text": "chunk_text"} } ) #pc.describe_index(index_name) #to get the host index=pc.Index(host='YOURINDEXENDPOINT') ..... article_data = [ {"id": f"article-{i}", "text": page["markdown"], "url": page["metadata"]["url"], "title": page["metadata"]["title"], "author": page["metadata"]["author"][0]} for i, page in enumerate(scrape_results['data']) ] #print(article_data) # Generate embeddings using LLaMA v2 for article in article_data: # Estrai il testo dell'articolo text = article["text"] #print(text) # Single article insert to avoid failures embedding = pc.inference.embed( model="llama-text-embed-v2", inputs=[text], parameters={"input_type": "passage"} ) # Prepare data for Pinecone upsert vector_data = { "id": article["id"], # Unique article ID "values": embedding[0]["values"], # Embedding vector "metadata": { "url": article["url"], # Store article URL "content": text[:300], # Store first 500 chars as a preview/snippet "title": article["title"][:100], "author": article["author"][:50] } } #print(vector_data) # Upsert the single article into Pinecone index.upsert(vectors=[vector_data], namespace="articles") print(f"✅ Upserted: {article['id']} ({article['title']})") # Optional: Add a short delay to prevent API rate limits (adjust as needed) time.sleep(1) Njengoba ungakwazi ukubona kusuka ku-code, thina ngokuvamile ukuguqulwa zonke izitolo ezidlulile futhi ukwengeza ku-index eyenziwe ezintsha. . article-index Uma ufuna ukudlala ngaphezulu nge Pinecone kukhona I-documentation ephelele ku-website yayo. I-documentation ephelele ku-website yayo. Kodwa manje ngifaka zonke izihloko ku-index, singakwazi ukuthatha ulwazi olufunayo? Ngithole i-script esisodwa ebizwa ngokuthi i-query.py ukuhlola imiphumela ye-search ku-index. Uma ungatholwe "Ungakutholela ezinye imibhalo mayelana nokuthintela i-Kasada?", isibuyekezo ivumela imibhalo ezilandelayo: {'matches': [{'id': 'article-0', 'metadata': {'author': 'Pierluigi Vinciguerra', ..., 'title': 'THE LAB #76: Bypassing Kasada With Open ' 'Source Tools In 2025', 'url': 'https://substack.thewebscraping.club/p/bypassing-kasada-2025-open-source'}, 'score': 0.419812053, 'values': []}, {'id': 'article-129', 'metadata': {'author': 'Pierluigi Vinciguerra', ..., 'title': 'How to by-pass Kasada bot mitigation?', 'url': 'https://substack.thewebscraping.club/p/how-to-by-pass-kasada-bot-mitigation'}, 'score': 0.418432325, 'values': []}, {'id': 'article-227', 'metadata': {'author': 'Pierluigi Vinciguerra', ..., 'title': 'Scraping Kasada protected websites', 'url': 'https://substack.thewebscraping.club/p/scraping-kasada-protected-websites'}, 'score': 0.378159761, 'values': []}], 'namespace': 'articles', 'usage': {'read_units': 6}} Okuningi! Zonke izihloko zihlanganisa ngokufanayo! Ngokuvamile, kuleli ngosuku elandelayo, uzothola indlela yokuxhumana le DB ku-GPT4 bese ukwakha isixhumanisi elula yokusebenzisa ukuze ubhalise imibuzo futhi uthole idatha ebonakalayo. Umbhalo ingxenye “The Lab” series by Pierluigi Vinciguerra. Check out his Substack page ukuze uthole okwengeziwe ulwazi on Web Scraping. Umbhalo ingxenye “The Lab” series by Pierluigi Vinciguerra. Check out his Substack page ukuze uthole okwengeziwe ulwazi on Web Scraping. “Lab”