paint-brush
Furitaanka saxda ah ee Codsiyada RAG: Ka faa'iidaysiga Garaafyada Aqoonta ee Neo4j iyo LangChainby@neo4j
220 akhrin

Furitaanka saxda ah ee Codsiyada RAG: Ka faa'iidaysiga Garaafyada Aqoonta ee Neo4j iyo LangChain

by Neo4j9m2024/10/21
Read on Terminal Reader

Aad u dheer; In la akhriyo

Boostada baloogga waxay muujineysaa sida loo sameeyo garaafka aqoonta iyadoo la adeegsanayo LangChain. Summada ayaa laga heli karaa GitHub. Waxaad u baahan tahay inaad dejiso tusaale Neo4j. Bandhigan, waxaan u isticmaali doonaa [Elizabeth I's] bogga Wikipedia. Waxaan u isticmaali karnaa [LangChain loaders] si aan uga soo saarno oo u kala qaadno dukumeentiyada Wikipedia.
featured image - Furitaanka saxda ah ee Codsiyada RAG: Ka faa'iidaysiga Garaafyada Aqoonta ee Neo4j iyo LangChain
Neo4j HackerNoon profile picture
0-item
1-item


Jiilka la kordhiyey ee dib u soo celinta garaafyada ( GraphRAG ) ayaa sii xoogaysanaya oo noqonaya wax dheeraad ah oo xoog leh oo ku saabsan hababka soo-celinta vector-ka ee soo jireenka ah. Habkani waxa uu ka faa'iidaysanayaa dabeecadda habaysan ee xogta garaafyada, kuwaas oo habeeya xogta sida qanjidhada iyo xidhiidhada, si kor loogu qaado qoto dheer iyo macnaha guud ee macluumaadka la helay.



Tusaalaha garaafka aqoonta.

Garaafyadu aad bay ugu fiican yihiin inay matalaan oo u kaydiyaan macluumaadka kala duwan iyo kuwa isku xidhan si habaysan, iyagoo dadaal la'aan qabsanaya xidhiidhada adag iyo sifooyinka noocyada xogta kala duwan. Taas bedelkeeda, xogta xogta vector waxay inta badan la halgamaysaa macluumaadka habaysan, maadaama xooggoodu ku jiro maaraynta xogta aan habaysanayn iyada oo loo marayo vectors cabbir sare leh. Codsigaaga RAG, waxaad isku dari kartaa xogta garaafyada habaysan iyo raadinta vector iyada oo loo marayo qoraal aan habaysan si aad u gaadho waxa ugu wanaagsan labada adduun. Taasi waa waxa aan ku muujin doono bartan blog.

Garaafyada Aqoontu Way Wanaagsan Yihiin, Laakin Sideed Mid U Samaysaa?

Dhisidda garaafka aqoonta sida caadiga ah waa tallaabada ugu adag. Waxay ku lug leedahay ururinta iyo qaabaynta xogta, taas oo u baahan faham qoto dheer oo ku saabsan qaabka domainka iyo garaafyada labadaba.


Si loo fududeeyo nidaamkan, waxaanu tijaabinaynay LLMs. Iyaga oo si qoto dheer u fahmaya luqadda iyo macnaha guud, LLM-yadu waxay otomaatig u noqon karaan qaybo muhiim ah oo ka mid ah habka abuurista garaafka aqoonta. Marka la falanqeeyo xogta qoraalka, moodooyinkani waxay aqoonsan karaan hay'adaha, fahmi karaan xiriirkooda, waxayna soo jeedin karaan sida ugu wanaagsan ee loogu muujin karo qaab-dhismeedka garaaf.


Natiijooyinka tijaabooyinkan, waxaan ku darnay nooca ugu horreeya ee qaabka dhismaha garaafyada LangChain, kaas oo aan ku muujin doono boggan blog.


Koodhka ayaa laga heli karaa GitHub .

Dejinta Deegaanka Neo4j

Waxaad u baahan tahay inaad dejiso tusaale Neo4j. Raac tusaalooyinka ku jira qoraalkan blog-ka. Habka ugu fudud ayaa ah in la bilaabo tusaale bilaash ah Neo4j Aura , kaas oo bixiya tusaalayaal daruuraha Neo4j database. Haddii kale, waxaad sidoo kale dejin kartaa tusaale maxalli ah oo ah kaydka Neo4j adiga oo soo dejinaya Neo4j Desktop-ka oo aad abuurto tusaale kaydka macluumaadka deegaanka.


 os.environ["OPENAI_API_KEY"] = "sk-" os.environ["NEO4J_URI"] = "bolt://localhost:7687" os.environ["NEO4J_USERNAME"] = "neo4j" os.environ["NEO4J_PASSWORD"] = "password" graph = Neo4jGraph()


Intaa waxaa dheer, waa inaad bixisaa furaha OpenAI , maadaama aan ku isticmaali doono moodooyinkooda boostada blog.

Qaadashada Xogta

Bandhigan, waxaan u isticmaali doonaa Elizabeth I boggeeda Wikipedia. Waxaan u isticmaali karnaa rarka LangChain si aan u soo saarno oo u kala qaybsano dukumeentiyada Wikipedia si aan kala go 'lahayn.


 # Read the wikipedia article raw_documents = WikipediaLoader(query="Elizabeth I").load() # Define chunking strategy text_splitter = TokenTextSplitter(chunk_size=512, chunk_overlap=24) documents = text_splitter.split_documents(raw_documents[:3])


Waa waqtigii la dhisi lahaa garaaf ku salaysan dukumentiyada la soo saaray. Ujeedadan awgeed, waxaanu hirgelinay LLMGraphTransformermodule kaas oo si weyn u fududaynaya dhisidda iyo kaydinta garaafka aqoonta ee xogta garaafyada.


 llm=ChatOpenAI(temperature=0, model_name="gpt-4-0125-preview") llm_transformer = LLMGraphTransformer(llm=llm) # Extract graph data graph_documents = llm_transformer.convert_to_graph_documents(documents) # Store to neo4j graph.add_graph_documents( graph_documents, baseEntityLabel=True, include_source=True )


Waxaad qeexi kartaa LLM-ka aad rabto inaad isticmaasho jiilka garaafyada aqoonta. Waqtigan xaadirka ah, waxaan taageernaa kaliya moodooyinka wacitaanka shaqada ee OpenAI iyo Mistral. Si kastaba ha ahaatee, waxaan qorsheyneynaa inaan ballaarino xulashada LLM mustaqbalka. Tusaalahan, waxaan isticmaaleynaa GPT-4 kii ugu dambeeyay. Ogsoonow in tayada garaafka la sameeyay ay si weyn ugu xiran tahay qaabka aad isticmaalayso. Aragti ahaan, mar walba waxaad rabtaa inaad isticmaasho midka ugu kartida badan. Transformers-yada garaafyada LLM waxay soo celiyaan dukumentiyada garaafyada, kaas oo lagu soo dejin karo Neo4j iyada oo loo marayo habka add_graph_documents. Halbeegga aasaasiga ahEntityLabel waxa uu ku meeleeyaa wax dheeraad ah Hay'ad ku calaamadee nood kasta, kor u qaadida tusmaynta iyo waxqabadka weydiinta. Waxa ka mid ah_source xudunta waxay ku xidhaa noodhadhka dukumeentiyadooda asalka ah, fududaynta raadinta xogta iyo fahamka macnaha guud.


Waxaad ku eegi kartaa garaafka la soo saaray ee Neo4j Browser.


Qayb ka mid ah garaafka la sameeyay.


Ogsoonow in sawirkani uu ka dhigan yahay qayb ka mid ah garaafka la sameeyay.


Soo Celinta Isku-dhafka ah ee RAG

Jiilka garaafka ka dib, waxaanu isticmaali doonaa hab dib u soo celin isku-dhafan oo isku daraya tusmeeyayaasha vector iyo ereyada muhiimka ah iyo soo saarista garaaf ee codsiyada RAG.


Isku darka hybrid (vector + keyword) iyo hababka soo celinta garaafyada. Sawirka qoraaga


Jaantusku waxa uu muujinayaa habka dib u soo celinta ee ka bilaabmaya isticmaale su'aal keenaya, kaas oo markaa loo hagayo soo-celinta RAG. Dib-u-soo-dejintu waxay shaqaaleysiisaa raadinta keyword iyo vector si ay uga baarto xogta qoraalka ee aan qaabeysan waxayna ku daraa macluumaadka ay ka soo aruuriso garaafka aqoonta. Maadaama Neo4j uu leeyahay ereyada muhiimka ah iyo tusmooyinka vector-ka labadaba, waxaad hirgelin kartaa dhammaan saddexda ikhtiyaar ee dib u soo celinta oo leh hal nidaam xogta xogta. Xogta laga soo ururiyey ilahan waxa lagu shubaa LLM si loo dhaliyo oo loo bixiyo jawaabta u dambaysa.

Soo Celinta Xogta Aan Habaysan

Waxaad isticmaali kartaa habka Neo4jVector.from_existing_graph si aad ugu darto ereyada muhiimka ah iyo soo-celinta vector labadaba dukumentiyada. Habkani waxa uu habeeyaa ereyada muhiimka ah iyo tusmooyinka raadinta vector ee habka raadinta isku-dhafka ah, iyada oo la beegsanayo noodhka lagu calaamadiyay Document. Intaa waxaa dheer, waxay xisaabinaysaa qiyamka qoraalka qoraalka haddii ay maqan yihiin.


 vector_index = Neo4jVector.from_existing_graph( OpenAIEmbeddings(), search_type="hybrid", node_label="Document", text_node_properties=["text"], embedding_node_property="embedding" )


Tusmada xididka ayaa markaa loogu yeeri karaa habka isku midka ah_search.

Soo Celinta Sawirka

Dhanka kale, habaynta garaaf soo celinta ayaa aad ugu lug leh laakiin waxay bixisaa xoriyad badan. Tusaalahani wuxuu isticmaali doonaa tusmo qoraal buuxa ah si loo aqoonsado noodhadhka khuseeya iyo soo celinta xaafadooda tooska ah.


Soodejiye garaaf. Sawirka qoraaga



Soo-celinta garaafku waxa uu ku bilaabmaa iyada oo la aqoonsanayo cidda ay khusayso ee gelinta. Si ay u fududaato, waxaanu faraynaa LLM inay aqoonsato dadka, ururada, iyo goobaha. Si taas loo gaaro, waxaan u isticmaali doonaa LCEL habka cusub ee lagu daray_structured_output si taas loo gaaro.


 # Extract entities from text class Entities(BaseModel): """Identifying information about entities.""" names: List[str] = Field( ..., description="All the person, organization, or business entities that " "appear in the text", ) prompt = ChatPromptTemplate.from_messages( [ ( "system", "You are extracting organization and person entities from the text.", ), ( "human", "Use the given format to extract information from the following " "input: {question}", ), ] ) entity_chain = prompt | llm.with_structured_output(Entities)


Aan tijaabinno:


 entity_chain.invoke({"question": "Where was Amelia Earhart born?"}).names # ['Amelia Earhart']


Way fiicantahay, hadda oo aan ogaan karno cidda ku jirta su'aasha, aan isticmaalno tusmaynta qoraalka buuxa si aan ugu sawirno garaafka aqoonta. Marka hore, waxaan u baahanahay in aan qeexno tusmada qoraalka buuxa iyo shaqada dhalin doonta su'aalo qoraal buuxa ah oo u oggolaanaya in xoogaa si khaldan loo qoro, taas oo aynaan si faahfaahsan halkan ugu geli doonin.


 graph.query( "CREATE FULLTEXT INDEX entity IF NOT EXISTS FOR (e:__Entity__) ON EACH [e.id]") def generate_full_text_query(input: str) -> str: """ Generate a full-text search query for a given input string. This function constructs a query string suitable for a full-text search. It processes the input string by splitting it into words and appending a similarity threshold (~2 changed characters) to each word, then combines them using the AND operator. Useful for mapping entities from user questions to database values, and allows for some misspelings. """ full_text_query = "" words = [el for el in remove_lucene_chars(input).split() if el] for word in words[:-1]: full_text_query += f" {word}~2 AND" full_text_query += f" {words[-1]}~2" return full_text_query.strip()



Aan isku soo wada duubno hadda.


 # Fulltext index query def structured_retriever(question: str) -> str: """ Collects the neighborhood of entities mentioned in the question """ result = "" entities = entity_chain.invoke({"question": question}) for entity in entities.names: response = graph.query( """CALL db.index.fulltext.queryNodes('entity', $query, {limit:2}) YIELD node,score CALL { MATCH (node)-[r:!MENTIONS]->(neighbor) RETURN node.id + ' - ' + type(r) + ' -> ' + neighbor.id AS output UNION MATCH (node)<-[r:!MENTIONS]-(neighbor) RETURN neighbor.id + ' - ' + type(r) + ' -> ' + node.id AS output } RETURN output LIMIT 50 """, {"query": generate_full_text_query(entity)}, ) result += "\n".join([el['output'] for el in response]) return result



Hawsha qaabaysan ee retriever waxay ku bilaabataa ogaanshaha hay'adaha su'aasha isticmaalaha. Marka xigta, waxay ku celcelinaysaa hay'adaha la ogaaday oo waxay isticmaashaa qaab-dhismeedka Cypher si ay u soo ceshato xaafadda noodhadhka khuseeya. Aan tijaabinno!


 print(structured_retriever("Who is Elizabeth I?")) # Elizabeth I - BORN_ON -> 7 September 1533 # Elizabeth I - DIED_ON -> 24 March 1603 # Elizabeth I - TITLE_HELD_FROM -> Queen Of England And Ireland # Elizabeth I - TITLE_HELD_UNTIL -> 17 November 1558 # Elizabeth I - MEMBER_OF -> House Of Tudor # Elizabeth I - CHILD_OF -> Henry Viii # and more...


Soo Celinta kama dambaysta ah

Sida lagu sheegay bilawgii, waxaanu isku dari doonaa garaaf soo-dejiyaha aan habaysanayn si aanu u abuurno macnaha ugu dambeeya ee loo gudbiyay LLM.


 def retriever(question: str): print(f"Search query: {question}") structured_data = structured_retriever(question) unstructured_data = [el.page_content for el in vector_index.similarity_search(question)] final_data = f"""Structured data: {structured_data} Unstructured data: {"#Document ". join(unstructured_data)} """ return final_data


Sida aan ula macaamileyno Python, waxaan si fudud isugu dhejin karnaa wax soo saarka anagoo adeegsanayna f-string.

Qeexida Silsiladda RAG

Waxaan si guul leh u hirgelinay qaybta soo celinta ee RAG. Marka xigta, waxaan soo bandhigeynaa degdeg ah oo ka faa'iideysanaya macnaha guud ee ay bixiso soo-celinta isku-dhafka ah ee isku dhafan si ay u soo saaraan jawaabta, dhamaystirka hirgelinta silsiladda RAG.


 template = """Answer the question based only on the following context: {context} Question: {question} """ prompt = ChatPromptTemplate.from_template(template) chain = ( RunnableParallel( { "context": _search_query | retriever, "question": RunnablePassthrough(), } ) | prompt | llm | StrOutputParser() )


Ugu dambeyntii, waan sii wadi karnaa oo tijaabin karnaa hirgelinta RAG-ga isku-dhafka ah.


 chain.invoke({"question": "Which house did Elizabeth I belong to?"}) # Search query: Which house did Elizabeth I belong to? # 'Elizabeth I belonged to the House of Tudor.'


Waxa kale oo aan ku daray sifada dib-u-qorista weydiinta, taas oo u sahlaysa silsiladda RAG in ay la qabsato goobaha wada hadalka ee u oggolaanaya su'aalaha dabagalka. Marka la eego in aan isticmaalno hababka raadinta ereyada muhiimka ah, waa in aan dib u qornaa su'aalaha dabagalka si aan u wanaajino habkayaga raadinta.


 chain.invoke( { "question": "When was she born?", "chat_history": [("Which house did Elizabeth I belong to?", "House Of Tudor")], } ) # Search query: When was Elizabeth I born? # 'Elizabeth I was born on 7 September 1533.'


Waxaad u fiirsataa in Goorma ayay dhalatay? Markii ugu horreysay ayaa dib loo qoray goorma ayay Elizabeth dhalatay? . Weydiinta dib loo qoray ayaa markaa loo adeegsaday in lagu soo saaro macnaha guud oo looga jawaabo su'aasha.

Garaafyada Aqoonta Waa La Sameeyey

Iyadoo la bilaabayo LLMGraphTransformer, habka soo saarista garaafyada aqoonta hadda waa inuu noqdaa mid sahlan oo la heli karo, taasoo u sahlaysa qof kasta oo raadinaya inuu kor u qaado codsiyadooda RAG qoto dheer iyo macnaha guud ee garaafyada aqoontu bixiyaan. Tani waa bilow maadaama aan wax badan oo horumar ah qorsheynay.


Haddii aad hayso aragtiyo, soo jeedin, ama su'aalo ku saabsan samaynta garaafyadayada LLMs, fadlan haka labalabayn inaad gaadho.


Koodhka ayaa laga heli karaa GitHub .