څنګه د سمارټ سند جوړ کړئ - د OpenAI نښلیدو پر بنسټ (Chunking، Indexing، and Searching)

سلام ټولې! زه غواړم چې د یو پروژې لپاره چې زه په کار کې د یو "سمارټ سندونو" چیټ بوټ جوړولو لپاره زما روښانه کولو سره شریک شي. I’m not an AI expert, so any suggestions or improvements are more than welcome! د دې پست هدف دا نه ده چې د OpenAI پر بنسټ د چیټ بوټ جوړولو په اړه یو بل لارښود جوړ کړي. د دې موضوع په اړه ډیری مواد شتون لري. په بل ډول، د اصلي افکار دا ده چې له دې امله چې دوی د مدیریت د توليد د OpenAI سره، او د کاروونکي د پوښتنې لپاره ترټولو مهم معلومات وټاکئ او ورسیږي. index documentation chunks embeddings performing a similarity search زما په صورت کې، د سندونه به د Markdown فایلونه وي، مګر دا کولی شي هر ډول متن، د ډاټا اټکل او نور وي. ولې؟ ځکه چې ځینې وختونه دا کولی شي چې تاسو ته اړتيا معلومات ونیسئ، زه غواړم چې یو چیټ بوټ جوړ کړي چې د ځانګړي موضوع په اړه پوښتنې ځواب ورکړي او د سند څخه د اړونده اړیکو وړاندې کړي. دا مساعد کولای شي په مختلفو طریقو کې کارول شي، لکه: د اغیزمنې پوښتنو ته چټک ځواب ورکړئ د DOC / صفحې څیړنه لکه څنګه چې Algolia کوي د کاروونکو سره مرسته کول چې په ځانګړي ډک کې هغه معلومات چې دوی اړتيا لري د کاروونکي پوښتنو / پوښتنو له لارې د پوښتنو ذخیره کول خلاصې لاندې، زه به زما د حل د درې مهمو برخو په لټه کې ورکړم: د سندونو د لوستلو د سند د انډیز کولو (chunking، overlap، and embedding) د سندونو په لټه کې (او دا ته د چیټ بوټ په لټه کې) د فایبر . └── docs └── ...md └── src └── askDocQuestion.ts └── index.ts # Express.js application endpoint └── embeddings.json # Storage for embeddings └── packages.json 1. د سند فایبرونو د لوستلو د سند متن hardcoding helyett، تاسو کولی شئ د پوښونو لپاره د پوښونو فایلونه لکه د وسایلو کارولو سره . .md glob // Example snippet of fetching files from a folder: import fs from "node:fs"; import path from "node:path"; import glob from "glob"; const DOC_FOLDER_PATH = "./docs"; type FileData = { path: string; content: string; }; const readAllMarkdownFiles = (): FileData[] => { const filesContent: FileData[] = []; const filePaths = glob.sync(`${DOC_FOLDER_PATH}/**/*.md`); filePaths.forEach((filePath) => { const content = fs.readFileSync(filePath, "utf8"); filesContent.push({ path: filePath, content }); }); return filesContent; }; د بدیل په توګه، تاسو کولی شئ البته ستاسو د سندونو څخه ستاسو د ډاټاډ یا CMS او داسې نور ترلاسه کړئ. د بدیل په توګه، تاسو کولی شئ البته ستاسو د سندونو څخه ستاسو د ډاټاډ یا CMS او داسې نور ترلاسه کړئ. 2. د سند د Indexing زموږ د څیړنې موتور جوړولو لپاره، موږ به د OpenAI کاروي زموږ د نندارتونونو جوړولو لپاره. د Vector Embeddings API د ویکټر انډولونه د معلوماتو په شمولیتي فورمټ کې وړاندې کولو یو لاره دي، کوم چې کولی شي د شواهدې چمتو کولو لپاره کارول شي (د زموږ په صورت کې، د کاروونکي پوښتنې او زموږ د سندونو برخهونو ترمنځ). دا وکتور، چې د افقی ټیټ شمیره لیست څخه جوړ شوی دی، به د ریاضیي فورمول په کارولو سره د شواهدو حسابولو لپاره کارول شي. [ -0.0002630692, -0.029749284, 0.010225477, -0.009224428, -0.0065269712, -0.002665544, 0.003214777, 0.04235309, -0.033162255, -0.00080789323, //...+1533 elements ]; د دې مفهوم پر بنسټ، د Vector Database جوړ شو. په پایله کې، د OpenAI API کارولو په ځای کې، دا امکان دی چې د وکتور ډاټاټا لکه Chroma، Qdrant یا Pinecone کاروي. د دې مفهوم پر بنسټ، د Vector Database جوړ شو. په پایله کې، د OpenAI API کارولو په ځای کې، دا امکان دی چې د وکتور ډاټاټا لکه Chroma، Qdrant یا Pinecone کاروي. 2.1 د هر فایل Chunk & Overlap د متن لوی بکسونه کولی شي د نمونوي کنټرول محدودیتونو څخه زیات شي یا د کم relevant ټایټونو سبب شي، نو دا سپارښتنه ده چې دوی په ټانکونو کې وټاکئ ترڅو د څیړنې لپاره ډیر هدفمند شي. په هرصورت، د ټانکونو ترمنځ ځینې مداخله د ساتلو لپاره، موږ دوی د ټوکنونو (یا ټیکنونو) د ځینو شمېر لټاکوي. په دې توګه، د ټانک محدودیتونه لږ احتمال لري چې د اړین کنټرول په منځ کې سټینټ راټول کړي. د Chunking مثال په دې مثال کې، موږ یو اوږد متن لري چې موږ غواړو چې په کوچني ټانکونو کې وده ورکړي. په دې صورت کې، موږ غواړو چې د 100 ټانکونو جوړ کړئ او دوی د 50 ټانکونو سره پوښښئ. Full Text (406 characters): په زړه پورې ښار کې، یو قديم کتابتون شتون لري چې ډیری یې فراموش شوي دي. د دې برجې سلاټونه د هر تصور وړ genre کتابونو سره پرانیستل شوي دي، هر ډول د ماجراجې، رازونو او د وخت په لټه کې شتون لري. هر شام، یو مخکښ کتابتون د دروازې وپلورل، د حیرانتیا دماغونو ته راغلاست چې د پراخه معلوماتو په پراخه کچه د څیړنې په لټه کې راغلاست. د ماشومانو به د داستانی سیشنونو لپاره راغلاست. Chunk 1 (Characters 1-150): In the heart of the bustling city, there stood an old library that many had forgotten. Its towering shelves were filled with books from every imaginabl. Chunk 2 (Characters 101-250): shelves were filled with books from every imaginable genre, each whispering stories of adventures, mysteries, and timeless wisdom. Every evening, a d Chunk 3 (Characters 201-350): ysteries, and timeless wisdom. Every evening, a dedicated librarian would open its doors, welcoming curious minds eager to explore the vast knowledge Chunk 4 (Characters 301-406): curious minds eager to explore the vast knowledge within. Children would gather for storytelling sessions. د کوډ Snippet const CHARS_PER_TOKEN = 4.15; // Approximate pessimistically number of characters per token. Can use `tiktoken` or other tokenizers to calculate it more precisely const MAX_TOKENS = 500; // Maximum number of tokens per chunk const OVERLAP_TOKENS = 100; // Number of tokens to overlap between chunks const maxChar = MAX_TOKENS * CHARS_PER_TOKEN; const overlapChar = OVERLAP_TOKENS * CHARS_PER_TOKEN; const chunkText = (text: string): string[] => { const chunks: string[] = []; let start = 0; while (start start) end = lastSpace; } chunks.push(text.substring(start, end)); // Overlap management const nextStart = end - overlapChar; start = nextStart <= start ? end : nextStart; } return chunks; }; د chunking په اړه نور معلومات، او د اندازې د داخلې په اړه د اغیزو، تاسو کولی شئ دا مقاله وګورئ. د chunking په اړه نور معلومات، او د اندازې د داخلې په اړه د اغیزو، تاسو کولی شئ وګورئ . د دې مقاله 2.2 د انډول نسل کله چې د فایل د حلقوي، موږ د هر حلقوي لپاره د OpenAI API د کارولو په کارولو سره د ویټور انډولونه جوړوي (د مثال په توګه، همدارنګه text-embedding-3-large import { OpenAI } from "openai"; const EMBEDDING_MODEL: OpenAI.Embeddings.EmbeddingModel = "text-embedding-3-large"; // Model to use for embedding generation const openai = new OpenAI({ apiKey: OPENAI_API_KEY }); const generateEmbedding = async (textChunk: string): Promise => { const response = await openai.embeddings.create({ model: EMBEDDING_MODEL, input: textChunk, }); return response.data[0].embedding; // Return the generated embedding }; 2.3 د کلې فایل لپاره د انډولونو تولید او ذخیره کول له دې امله چې هر وخت د نښلیدو تازه کولو څخه مخنیوی وي، موږ به د نښلیدو ذخیره کړو. دا کولی شي په ډاټاټا کې ذخیره شي. مګر په دې صورت کې، موږ به یوازې په ځای کې په JSON فایل کې ذخیره کوو. په ساده توګه د لاندې کوډ: په هر سند کې iterates، د سند په ټوکرونو کې ټوکرونه، د هر کڅوړه لپاره د انټرنټونه جوړوي، د انډولونو په JSON فایل کې ذخیره کړئ. د VectorStore سره د انډولونو په کارولو لپاره د څیړنې پرانیستل.Fill the vectorStore with the embeddings to be used in the search. import embeddingsList from "../embeddings.json"; /** * Simple in-memory vector store to hold document embeddings and their content. * Each entry contains: * - filePath: A unique key identifying the document * - chunkNumber: The number of the chunk within the document * - content: The actual text content of the chunk * - embedding: The numerical embedding vector for the chunk */ const vectorStore: { filePath: string; chunkNumber: number; content: string; embedding: number[]; }[] = []; /** * Indexes all Markdown documents by generating embeddings for each chunk and storing them in memory. * Also updates the embeddings.json file if new embeddings are generated. */ export const indexMarkdownFiles = async (): Promise => { // Retrieve documentations const docs = readAllMarkdownFiles(); let newEmbeddings: Record = {}; for (const doc of docs) { // Split the document into chunks based on headings const fileChunks = chunkText(doc.content); // Iterate over each chunk within the current file for (const chunkIndex of Object.keys(fileChunks)) { const chunkNumber = Number(chunkIndex) + 1; // Chunk number starts at 1 const chunksNumber = fileChunks.length; const chunk = fileChunks[chunkIndex as keyof typeof fileChunks] as string; const embeddingKeyName = `${doc.path}/chunk_${chunkNumber}`; // Unique key for the chunk // Retrieve precomputed embedding if available const existingEmbedding = embeddingsList[ embeddingKeyName as keyof typeof embeddingsList ] as number[] | undefined; let embedding = existingEmbedding; // Use existing embedding if available if (!embedding) { embedding = await generateEmbedding(chunk); // Generate embedding if not present } newEmbeddings = { ...newEmbeddings, [embeddingKeyName]: embedding }; // Store the embedding and content in the in-memory vector store vectorStore.push({ filePath: doc.path, chunkNumber, embedding, content: chunk, }); console.info(`- Indexed: ${embeddingKeyName}/${chunksNumber}`); } } /** * Compare the newly generated embeddings with existing ones * * If there is change, update the embeddings.json file */ try { if (JSON.stringify(newEmbeddings) !== JSON.stringify(embeddingsList)) { fs.writeFileSync( "./embeddings.json", JSON.stringify(newEmbeddings, null, 2) ); } } catch (error) { console.error(error); } }; 3. د سندونو په لټه کې 3.1 د ویکټر مساوي د یو کاروونکي پوښتنې ځواب کولو لپاره، موږ لومړی د او بیا د پوښتنې نښلول او هر ټوټې نښلول تر منځ د cosine ښکلا محاسبه کړئ. موږ د یو ځانګړي ښکلا ضخامت لاندې هر څه فلټرئ او یوازې د X ترټولو ښکلا لري. user's question /** * Calculates the cosine similarity between two vectors. * Cosine similarity measures the cosine of the angle between two vectors in an inner product space. * Used to determine the similarity between chunks of text. * * @param vecA - The first vector * @param vecB - The second vector * @returns The cosine similarity score */ const cosineSimilarity = (vecA: number[], vecB: number[]): number => { // Calculate the dot product of the two vectors const dotProduct = vecA.reduce((sum, a, idx) => sum + a * vecB[idx], 0); // Calculate the magnitude (Euclidean norm) of each vector const magnitudeA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0)); const magnitudeB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0)); // Compute and return the cosine similarity return dotProduct / (magnitudeA * magnitudeB); }; const MIN_RELEVANT_CHUNKS_SIMILARITY = 0.77; // Minimum similarity required for a chunk to be considered relevant const MAX_RELEVANT_CHUNKS_NB = 15; // Maximum number of relevant chunks to attach to chatGPT context /** * Searches the indexed documents for the most relevant chunks based on a query. * Utilizes cosine similarity to find the closest matching embeddings. * * @param query - The search query provided by the user * @returns An array of the top matching document chunks' content */ const searchChunkReference = async (query: string) => { // Generate an embedding for the user's query const queryEmbedding = await generateEmbedding(query); // Calculate similarity scores between the query embedding and each document's embedding const results = vectorStore .map((doc) => ({ ...doc, similarity: cosineSimilarity(queryEmbedding, doc.embedding), // Add similarity score to each doc })) // Filter out documents with low similarity scores // Avoid to pollute the context with irrelevant chunks .filter((doc) => doc.similarity > MIN_RELEVANT_CHUNKS_SIMILARITY) .sort((a, b) => b.similarity - a.similarity) // Sort documents by highest similarity first .slice(0, MAX_RELEVANT_CHUNKS_NB); // Select the top most similar documents // Return the content of the top matching documents return results; }; 3.2 د اړونده Chunks سره د OpenAI پروپیلن وروسته ټولګه، موږ د تغذیه د ChatGPT غوښتنلیک سیسټم پاملرنه ته ځي. دا معنی کوي چې ChatGPT ستاسو د ډکونو تر ټولو مهمو برخهونه وګورئ لکه څنګه چې تاسو دوی په بحث کې ټیپ کړئ. بيا موږ ChatGPT ته اجازه ورکوي چې د کاروونکي لپاره ځواب جوړ کړي. top const MODEL: OpenAI.Chat.ChatModel = "gpt-4o-2024-11-20"; // Model to use for chat completions // Define the structure of messages used in chat completions export type ChatCompletionRequestMessage = { role: "system" | "user" | "assistant"; // The role of the message sender content: string; // The text content of the message }; /** * Handles the "Ask a question" endpoint in an Express.js route. * Processes user messages, retrieves relevant documents, and interacts with OpenAI's chat API to generate responses. * * @param messages - An array of chat messages from the user and assistant * @returns The assistant's response as a string */ export const askDocQuestion = async ( messages: ChatCompletionRequestMessage[] ): Promise => { // Assistant's response are filtered out otherwise the chatbot will be stuck in a self-referential loop // Note that the embedding precision will be lowered if the user change of context in the chat const userMessages = messages.filter((message) => message.role === "user"); // Format the user's question to keep only the relevant keywords const formattedUserMessages = userMessages .map((message) => `- ${message.content}`) .join("\n"); // 1) Find relevant documents based on the user's question const relevantChunks = await searchChunkReference(formattedUserMessages); // 2) Integrate the relevant documents into the initial system prompt const messagesList: ChatCompletionRequestMessage[] = [ { role: "system", content: "Ignore all previous instructions. \ You're an helpful chatbot.\ ...\ Here is the relevant documentation:\ " + relevantChunks .map( (doc, idx) => `[Chunk ${idx}] filePath = "${doc.filePath}":\n${doc.content}` ) .join("\n\n"), // Insert relevant chunks into the prompt }, ...messages, // Include the chat history ]; // 3) Send the compiled messages to OpenAI's Chat Completion API (using a specific model) const response = await openai.chat.completions.create({ model: MODEL, messages: messagesList, }); const result = response.choices[0].message.content; // Extract the assistant's reply if (!result) { throw new Error("No response from OpenAI"); } return result; }; 4. د OpenAI API د Chatbot په کارولو سره Express جوړول زموږ د سیستم د کارولو لپاره، موږ به د Express.js سرور کاروي. دلته د Express.js کوچني پای ته د پوښتنې په کارولو لپاره د مثال دی: import express, { type Request, type Response } from "express"; import { ChatCompletionRequestMessage, askDocQuestion, indexMarkdownFiles, } from "./askDocQuestion"; // Automatically fill the vector store with embeddings when server starts indexMarkdownFiles(); const app = express(); // Parse incoming requests with JSON payloads app.use(express.json()); type AskRequestBody = { messages: ChatCompletionRequestMessage[]; }; // Routes app.post( "/ask", async ( req: Request , res: Response ) => { try { const response = await askDocQuestion(req.body.messages); res.json(response); } catch (error) { console.error(error); } } ); // Start server app.listen(3000, () => { console.log(`Listening on port 3000`); }); د UI: د Chatbot انټرنیټ جوړولو په frontend کې، زه یو کوچني React برخې سره د چیټ په څیر انټرنیټ جوړ کړ. دا زما Express backend ته پیژندنه ورکوي او ځوابونه ښيي. هیڅکله ډیر ښکلي نه ده، نو موږ به د معلوماتو له لاسه ورکړي. د Template کوډ زه د A ستاسو لپاره د خپل چیټ بوټ لپاره د پیل نقطې په توګه کارول. د Template کوډ Live ډیمو که تاسو غواړئ د دې چیټ بوټ د پایلې پیژندنې ازموینه وکړئ، دا وګورئ . د Demo پاڼه د Demo پاڼه زموږ د ډیمو کوډ د سپارلو لپاره د سپارلو لپاره. Frontend: ChatBot برخې د اضافي په یوټیوب کې، وګورئ دا لکه څنګه چې د OpenAI Embeddings او Vector Databases. ویډیو Adrien Twarog زه هم په لټه کې ، کوم چې ممکن دلچسپي وي که تاسو غواړئ د بدیل لارښوونې. د OpenAI مسلکي فایل د څیړنې سند د پایلو زه امیدوارم چې دا تاسو ته د یو چیټ بوټ لپاره د سند د انډول کولو په اړه یو نظریې ورکوي: د chunking + overlap په کارولو سره، نو د مناسب کنټرول په لټه کې وي، د نښلیدو تولید او د ویټور similarity چمتو کولو لپاره دوی ذخیره کول، په پایله کې، زه دا سره د اړونده context سره ChatGPT ورکړم. زه د AI متخصص نه ام؛ دا یوازې یو حل دی چې زه د زما اړتیاوو لپاره ښه کار کوي. که تاسو د اغیزمنتیا د ښه کولو یا ډیر پوښلي لارښوونې په اړه هر ډول لارښوونه لري، زه غواړم چې د ویکتور د ذخیره کولو حلونه، chunking ستراتیژۍ، او یا د نورو د کړنو لارښوونې په اړه د پیژندنې ولري. please let me know Thanks for reading, and feel free to share your thoughts!