paint-brush
Kuvaka Magirafu eRuzivo eRAG: Kuongorora GraphRAG neNeo4j uye LangChainby@neo4j
1,311 kuverenga
1,311 kuverenga

Kuvaka Magirafu eRuzivo eRAG: Kuongorora GraphRAG neNeo4j uye LangChain

by Neo4j32m2024/10/22
Read on Terminal Reader

Kurebesa; Kuverenga

Ichi chinyorwa chinoongorora mashandisirwo epaipi ye "Kubva Kunzvimbo kuenda kuGlobal" GraphRAG uchishandisa Neo4j uye LangChain. Inovhara maitiro ekugadzira ruzivo magirafu kubva muzvinyorwa, kupfupisa nharaunda dzemasangano anoshandisa Large Mutauro Models (LLMs), uye kusimudzira Retrieval-Augmented Generation (RAG) huchokwadi nekubatanidza graph algorithms neLLM-based summarization. Iyo nzira inokonesa ruzivo kubva kune akawanda masosi kuita magirafu akarongeka uye inogadzira pfupiso yemutauro wechisikigo, ichipa nzira inoshanda yekutora ruzivo rwakaoma.
featured image - Kuvaka Magirafu eRuzivo eRAG: Kuongorora GraphRAG neNeo4j uye LangChain
Neo4j HackerNoon profile picture
0-item
1-item


Ini ndinogara ndichinakidzwa nemaitiro matsva ekushandisa Retrieval-Augmented Generation (RAG) pamusoro pemagirafu, anowanzonzi GraphRAG. Nekudaro, zvinoita sekunge munhu wese ane kwakasiyana kuita mupfungwa pavanonzwa izwi rekuti GraphRAG. Mune ino blog positi, tichanyura mukati me " Kubva Kunzvimbo kuenda kuGlobal GraphRAG " chinyorwa uye kuitiswa nevatsvagiri veMicrosoft. Isu tichavhara ruzivo rwegirafu kuvaka uye muchidimbu chikamu uye tosiya zvidzoreso kune inotevera blog post. Vatsvagiri vaive nemoyo munyoro zvekuti vatipa iyo kodhi repository, uye ivo vane peji reprojekiti zvakare.


Maitiro akatorwa munyaya yataurwa pamusoro apa anonakidza chaizvo. Sekunzwisisa kwandinoita, zvinosanganisira kushandisa girafu yeruzivo senhanho mupombi yekupfupisa nekubatanidza ruzivo kubva kune akawanda masosi. Kubvisa masangano uye hukama kubva muzvinyorwa hachisi chinhu chitsva. Nekudaro, vanyori vanosuma inoveli (zvishoma kwandiri) pfungwa yekupfupisa yakapfupikiswa girafu chimiro uye ruzivo kumashure sechisikigo chinyorwa chemutauro. Iyo pombi inotanga nemashoko ekuisa kubva kumagwaro, ayo anogadziriswa kuti abudise girafu. Girafu rinobva radzoserwa kuita mavara emutauro wechisikigo, uko mavara akagadzirwa ane ruzivo rwakapfupikiswa nezvechaiwo masangano kana nharaunda dzemagraph dzakambopararira mumagwaro akawanda.


Yepamusoro-soro peipi yekunongedza sekushandiswa mubepa reGraphRAG neMicrosoft — Mufananidzo ne munyori


Padanho repamusoro, iyo yekuisa kune iyo GraphRAG pombi ndeye zvinyorwa zvinyorwa zvine ruzivo rwakasiyana. Mapepa aya anogadziriswa pachishandiswa LLM kuburitsa ruzivo rwakarongwa nezve masangano anoonekwa mumapepa pamwe nehukama hwavo. Iri ruzivo rwakaburitswa rwakarongeka rwunobva rwashandiswa kugadzira girafu yeruzivo.


Kubatsira kwekushandisa ruzivo girafu data inomiririra ndeyekuti inogona nekukurumidza uye zvakananga kusanganisa ruzivo kubva kune akawanda magwaro kana data masosi nezvechimwe masangano. Sezvambotaurwa, girafu yeruzivo haisi iyo chete inomiririra data, zvakadaro. Mushure mekunge girafu yeruzivo yavakwa, vanoshandisa musanganiswa wemagirafu algorithms uye LLM ichikurudzira kuburitsa mitauro yechisikigo pfupiso dzenharaunda dzemasangano anowanikwa mugirafu yeruzivo.


Pfupiso idzi dzinobva dzave neruzivo rwakapfupikiswa runopararira kune akawanda data masosi uye zvinyorwa zvemamwe masangano nenharaunda.


Kuti uwane kunzwisisa kwakadzama kwepombi, tinogona kutaura kune tsananguro-nhanho-nhanho yakapihwa mubepa rekutanga.


Matanho mupombi - Mufananidzo kubva papepa reGraphRAG , rine rezinesi pasi peCC BY 4.0


Inotevera ipfupiso yepamusoro yepombi yatichashandisa kuburitsa nzira yavo tichishandisa Neo4j uye LangChain.

Indexing - Girafu Generation

  • Source Documents to Text Chunks : Zvinyorwa zvinyorwa zvakakamurwa kuita zvidiki zvinyorwa zvinyorwa zvekugadzirisa.
  • Mavara Chunks kune Element Instances : Imwe neimwe chinyorwa chunk inoongororwa kuti ibvise masangano uye hukama, ichigadzira runyoro rwematuples anomiririra zvinhu izvi.
  • Element Instances kuElement Summarys : Zvikamu zvakatorwa uye hukama zvinopfupikiswa neLLM muzvinyorwa zvinotsanangura zvechinhu chimwe nechimwe.
  • Element Summaries to Graph Communities : Izvi zvipfupiso zvesangano zvinoumba girafu, rinobva ragovaniswa mumatunhu vachishandisa maalgorithms akaita seLeiden echimiro chehierarchical.
  • Magirafu Nharaunda kuNharaunda Pfupiso : Pfupiso dzenharaunda yega yega dzinogadzirwa neLLM kuti unzwisise iyo dataset yepasi rose topical chimiro uye semantics.

Kudzorera - Kupindura

  • Pfupiso dzeNharaunda kuMhinduro dzepasi rose : Pfupiso dzenharaunda dzinoshandiswa kupindura mubvunzo wemushandisi nekugadzira mhinduro dzepakati, dzinobva dzaunganidzwa kuita mhinduro yekupedzisira yepasirese.


Ziva kuti kuita kwangu kwaitwa kodhi yavo isati yavepo, saka panogona kunge paine misiyano mishoma munzira yepasi kana LLM kukurudzira kuri kushandiswa. Ndichaedza kutsanangura kusiyana ikoko sezvatinoenderera mberi.


Iyo kodhi inowanikwa paGitHub .

Kumisikidza Neo4j Nzvimbo

Isu tichashandisa Neo4j seyepasi girafu chitoro. Nzira iri nyore yekutanga ndeye kushandisa yemahara muenzaniso yeNeo4j Sandbox , iyo inopa gore zviitiko zveNeo4j dhatabhesi ine Graph Data Science plugin yakaiswa. Neimwe nzira, unogona kuseta yemuno muenzaniso weNeo4j dhatabhesi nekurodha pasi Neo4j Desktop application uye kugadzira yenzvimbo dhatabhesi muenzaniso. Kana uri kushandisa shanduro yemunharaunda, ita shuwa yekuisa ese APOC uye GDS plugins. Kune kuseta kwekugadzira, unogona kushandisa yakabhadharwa, yakachengetedzwa AuraDS (Data Sayenzi) muenzaniso, iyo inopa iyo GDS plugin.


Isu tinotanga nekugadzira Neo4jGraph muenzaniso, inova iri nyore kuputira isu takawedzera kuLangChain:


 from langchain_community.graphs import Neo4jGraph os.environ["NEO4J_URI"] = "bolt://44.202.208.177:7687" os.environ["NEO4J_USERNAME"] = "neo4j" os.environ["NEO4J_PASSWORD"] = "mast-codes-trails" graph = Neo4jGraph(refresh_schema=False)

Dataset

Tichashandisa chinyorwa chenhau dataset yandakagadzira imwe nguva yapfuura ndichishandisa Diffbot's API. Ndakaiisa kuGitHub yangu kuti ishandise zvakare nyore:


 news = pd.read_csv( "https://raw.githubusercontent.com/tomasonjo/blog-datasets/main/news_articles.csv" ) news["tokens"] = [ num_tokens_from_string(f"{row['title']} {row['text']}") for i, row in news.iterrows() ] news.head()


Ngationgororei mitsetse miviri yekutanga kubva kune dataset.


Sample row kubva ku dataset


Isu tine zita uye zvinyorwa zvezvinyorwa zviripo, pamwe chete nezuva ravo rekuburitsa uye tokeni kuverenga vachishandisa tiktoken raibhurari.

Text Chunking

Iyo text chunking nhanho yakakosha uye inonyanya kukanganisa yakadzika mhedzisiro. Vanyori vepepa vakaona kuti kushandisa zvidiki zvinyorwa zvidiki zvinokonzeresa kuburitsa mamwe masangano zvachose.


Nhamba yezvinyorwa zvakatorwa zvakapihwa saizi yezvinyorwa zvimedu - Mufananidzo kubva papepa reGraphRAG , rine rezinesi pasi peCC BY 4.0


Sezvauri kuona, kushandisa mameseji chunks e2,400 tokens inoguma mune mashoma akabudiswa masangano pane pavakashandisa mazana matanhatu ezviratidzo. Pamusoro pezvo, ivo vakaona kuti maLLM anogona kusaburitsa masangano ese pakutanga. Muchiitiko ichocho, vanounza heuristics kuti iite iyo yekubvisa kakawanda. Tichataura nezvazvo zvakanyanya muchikamu chinotevera.


Zvisinei, pane nguva dzose kutengeserana-offs. Kushandisa zvidiki zvinyorwa zvidiki zvinogona kuguma mukurasikirwa nemamiriro ezvinhu uye mareferensi emamwe masangano akapararira mumagwaro. Somuenzaniso, kana gwaro rinotaura “Johani” uye “iye” mumitsara yakasiyana, kupatsanura mashoko acho kuita zvidimbu zvidiki zvingaita kuti zvisajeka kuti “iye” anoreva Johani. Dzimwe dzenyaya dzepakati dzinogona kugadziriswa uchishandisa inopindirana mameseji chunking zano, asi kwete ese.


Ngationgororei saizi yezvinyorwa zvedu zvechinyorwa:


 sns.histplot(news["tokens"], kde=False) plt.title('Distribution of chunk sizes') plt.xlabel('Token count') plt.ylabel('Frequency') plt.show() 



Kugoverwa kwechinyorwa tokeni kuverenga kunenge kwakajairwa, nenhamba yepamusoro inosvika mazana mana ematokeni. Kuwanda kwemachunks kunowedzera zvishoma nezvishoma kusvika panhongonya iyi, yobva yadzikira zvakaenzanirana, zvichiratidza kuti machunks mazhinji ari pedyo ne400-token mark.


Nekuda kwekugovera uku, isu hatichaita chero mavara chunking pano kudzivirira coreference nyaya. Nekutadza, chirongwa cheGraphRAG chinoshandisa chunk saizi yemazana matatu tokens ine zana tokens yekupindirana.

Kubvisa Node uye Hukama

Nhanho inotevera ndeyekugadzira ruzivo kubva kune zvinyorwa zvinyorwa. Kune iyi kesi yekushandisa, isu tinoshandisa LLM kuburitsa yakarongwa ruzivo mumhando yemanodhi uye hukama kubva muzvinyorwa. Iwe unogona kuongorora iyo LLM kukurumidza vanyori vakashandiswa mubepa. Ivo vane LLM yekukurudzira kwatinogona kufanotsanangura node label kana zvichidikanwa, asi nekusarudzika, izvo zvinosarudzika. Pamusoro pezvo, hukama hwakabviswa muzvinyorwa zvepakutanga hauna mhando, rondedzero chete. Ini ndinofungidzira chikonzero chiri kumashure kwesarudzo iyi kubvumidza iyo LLM kuburitsa uye kuchengetedza yakapfuma uye yakawanda nuanced ruzivo sehukama. Asi zvakaoma kuve neruzivo rwakachena girafu isina hukama-mhando yemhando (iyo tsananguro dzinogona kupinda muchivakwa).


Mukuita kwedu, tichashandisa LLMGraphTransformer , iyo inowanikwa muraibhurari yeLangChain. Panzvimbo pekushandisa kwakachena kukurumidza engineering, sekuitwa kwechinyorwa bepa, iyo LLMGraphTransformer inoshandisa yakavakirwa-mukati basa rekudaidza rutsigiro kuburitsa yakarongwa ruzivo (yakarongeka yakabuda LLMs muLangChain). Unogona kuongorora sisitimu nekukurumidza :


 from langchain_experimental.graph_transformers import LLMGraphTransformer from langchain_openai import ChatOpenAI llm = ChatOpenAI(temperature=0, model_name="gpt-4o") llm_transformer = LLMGraphTransformer( llm=llm, node_properties=["description"], relationship_properties=["description"] ) def process_text(text: str) -> List[GraphDocument]: doc = Document(page_content=text) return llm_transformer.convert_to_graph_documents([doc])


Mumuenzaniso uyu, tinoshandisa GPT-4o yekubvisa magirafu. Vanyori vanonyatso raira LLM kutora masangano uye hukama uye tsananguro yavo . Nekuitwa kweLangChain, unogona kushandisa node_properties uye relationship_properties hunhu kutsanangura kuti ndeipi node kana hukama zvivakwa zvaunoda kuti LLM ibvise.


Musiyano neLLMGraphTransformer kuitiswa ndewekuti zvese node kana hukama zvivakwa zvinosarudzika, saka hadzisi dzese node dzinenge dziine description . Kana isu taida, taigona kutsanangura tsika yekubvisa kuti ive neinosungirwa description pfuma, asi isu tichasvetuka izvo mukuita uku.


Isu tichafananidza zvikumbiro zvekuita kuti girafu riwedzere nekukurumidza uye chengetedza mhedzisiro kuNeo4j:


 MAX_WORKERS = 10 NUM_ARTICLES = 2000 graph_documents = [] with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor: # Submitting all tasks and creating a list of future objects futures = [ executor.submit(process_text, f"{row['title']} {row['text']}") for i, row in news.head(NUM_ARTICLES).iterrows() ] for future in tqdm( as_completed(futures), total=len(futures), desc="Processing documents" ): graph_document = future.result() graph_documents.extend(graph_document) graph.add_graph_documents( graph_documents, baseEntityLabel=True, include_source=True )


Mumuenzaniso uyu, tinobvisa ruzivo rwegirafu kubva muzvinyorwa zve2,000 uye mhedzisiro yezvitoro kuNeo4j. Isu takatora zvakatenderedza 13,000 masangano uye 16,000 hukama. Heino muenzaniso wegwaro rakatorwa mugirafu.


Gwaro (rebhuruu) rinonongedza kune zvakaburitswa masangano uye hukama


Zvinotora anenge 35 (+/- 5) maminetsi kupedzisa kudhirowa uye inodhura madhora makumi matatu neGPT-4o.


Mune ino nhanho, vanyori vanounza heuristics kusarudza kana kuburitsa ruzivo rwegraph mune inodarika imwe pass. Nekuda kwekureruka, tichaita chete kupasa. Nekudaro, kana isu taida kuita akawanda anopfuura, isu taigona kuisa yekutanga yekubvisa mhedzisiro senhoroondo yekukurukurirana uye kungoraira iyo LLM kuti akawanda masangano arikushaikwa , uye inofanirwa kutora zvimwe, sezvinoita vanyori veGraphRAG.


Pakutanga, ndakataura kukosha kwemavara chunk saizi uye kuti inobata sei huwandu hwemasangano akabviswa. Sezvo isu tisina kuita mamwe mameseji chunking, tinokwanisa kuongorora kugoverwa kwezvinhu zvakabviswa zvichienderana nehukuru hwemavara chunk:


 entity_dist = graph.query( """ MATCH (d:Document) RETURN d.text AS text, count {(d)-[:MENTIONS]->()} AS entity_count """ ) entity_dist_df = pd.DataFrame.from_records(entity_dist) entity_dist_df["token_count"] = [ num_tokens_from_string(str(el)) for el in entity_dist_df["text"] ] # Scatter plot with regression line sns.lmplot( x="token_count", y="entity_count", data=entity_dist_df, line_kws={"color": "red"} ) plt.title("Entity Count vs Token Count Distribution") plt.xlabel("Token Count") plt.ylabel("Entity Count") plt.show() 




Iyo yekuparadzira chirongwa inoratidza kuti kunyangwe paine maitiro akanaka, anoratidzwa nemutsara mutsvuku, hukama huri sublinear. Mazhinji mapoinzi edata anoungana pazviverengero zvakaderera, kunyangwe nhamba dzetokeni dzichiwedzera. Izvi zvinoratidza kuti nhamba yemasangano akatorwa haiyereri zvakaenzanirana nehukuru hwezvinyorwa. Kunyangwe mamwe ekunze aripo, maitiro akajairwa anoratidza kuti ma tokeni epamusoro haarambe achitungamira kune yakakwira masangano. Izvi zvinosimbisa kuwanikwa kwevanyori kuti yakaderera mavara chunk saizi inoburitsa rumwe ruzivo.


Ini zvakare ndakafunga kuti zvingave zvinonakidza kuongorora iyo node degree kugoverwa kwegirafu rakavakwa. Iyo inotevera kodhi inotora uye inoona node degree kugoverwa:


 degree_dist = graph.query( """ MATCH (e:__Entity__) RETURN count {(e)-[:!MENTIONS]-()} AS node_degree """ ) degree_dist_df = pd.DataFrame.from_records(degree_dist) # Calculate mean and median mean_degree = np.mean(degree_dist_df['node_degree']) percentiles = np.percentile(degree_dist_df['node_degree'], [25, 50, 75, 90]) # Create a histogram with a logarithmic scale plt.figure(figsize=(12, 6)) sns.histplot(degree_dist_df['node_degree'], bins=50, kde=False, color='blue') # Use a logarithmic scale for the x-axis plt.yscale('log') # Adding labels and title plt.xlabel('Node Degree') plt.ylabel('Count (log scale)') plt.title('Node Degree Distribution') # Add mean, median, and percentile lines plt.axvline(mean_degree, color='red', linestyle='dashed', linewidth=1, label=f'Mean: {mean_degree:.2f}') plt.axvline(percentiles[0], color='purple', linestyle='dashed', linewidth=1, label=f'25th Percentile: {percentiles[0]:.2f}') plt.axvline(percentiles[1], color='orange', linestyle='dashed', linewidth=1, label=f'50th Percentile: {percentiles[1]:.2f}') plt.axvline(percentiles[2], color='yellow', linestyle='dashed', linewidth=1, label=f'75th Percentile: {percentiles[2]:.2f}') plt.axvline(percentiles[3], color='brown', linestyle='dashed', linewidth=1, label=f'90th Percentile: {percentiles[3]:.2f}') # Add legend plt.legend() # Show the plot plt.show() 



Iyo node dhigirii kugovera kunotevera simba-mutemo patani, zvichiratidza kuti ma node mazhinji ane mashoma ekubatanidza nepo mashoma mashoma akabatana zvakanyanya. Iyo yepakati dhigirii ndeye 2.45, uye yepakati i1.00, zvichiratidza kuti inopfuura hafu yemanodhi ine chete kubatana. Manodhi mazhinji (75 muzana) ane maviri kana mashoma akabatana, uye 90 muzana ane mashanu kana mashoma. Uku kugovera kwakajairwa neakawanda epasirese network, uko nhamba diki yehubhu ine akawanda ekubatanidza, uye mazhinji node ane mashoma.


Sezvo zvese node uye tsananguro yehukama isiri yekumanikidza zvivakwa, isu tichaongorora zvakare kuti mangani akaburitswa:


 graph.query(""" MATCH (n:`__Entity__`) RETURN "node" AS type, count(*) AS total_count, count(n.description) AS non_null_descriptions UNION ALL MATCH (n)-[r:!MENTIONS]->() RETURN "relationship" AS type, count(*) AS total_count, count(r.description) AS non_null_descriptions """)



Mhedzisiro yacho inoratidza kuti 5,926 nodes kunze kwe12,994 (45.6 muzana) ine pfuma yekutsanangura. Kune rumwe rutivi, hukama 5,569 chete kubva pa15,921 (35 muzana) vane pfuma yakadaro.


Ziva kuti nekuda kwekugona kuitika kweLLMs, manhamba anogona kusiyana pane akasiyana anomhanya uye akasiyana dhata, maLLM, uye zvinokurudzira.

Entity Resolution

Kugadziriswa kwesangano (de-duplication) kwakakosha pakugadzira magirafu ezivo nekuti inova nechokwadi chekuti chimwe nechimwe chinomiririrwa zvakasiyana uye nemazvo, kudzivirira kudzokororwa uye kubatanidza marekodhi anoreva chinhu chimwe chete chenyika chaiyo. Iyi nzira yakakosha pakuchengetedza kuvimbika kwedata uye kuenderana mukati megirafu. Pasina kugadziriswa kwesangano, magirafu ezivo angatambura kubva kune zvakakamurwa uye zvisingaenderane data, zvichikonzera kukanganisa uye kusavimbika kuonesa.


Zvinogona kuitika zvadzokororwa


Mufananidzo uyu unoratidza kuti chinhu chimwechete-chaiyo-chaiyo chenyika chingaratidzika sei pasi pemazita akasiyana zvishoma mumagwaro akasiyana uye, nokudaro, mugirafu redu.


Zvakare, sparse data inova nyaya yakakosha pasina mubatanidzwa kugadzirisa. Data isina kukwana kana chidimbu kubva kwakasiyana-siyana inogona kukonzera kupararira uye kubviswa zvidimbu zveruzivo, zvichiita kuti zviome kuumba kunzwisisa kwakabatana uye kwakadzama kwemasangano. Chaizvo chechimwe chigadziriso chinogadzirisa izvi nekubatanidza data, kuzadza mapeji, uye kugadzira maonero akabatana echinhu chimwe nechimwe.


Pamberi / mushure mekushandisa Senzing entity resolution yekubatanidza iyo International Consortium ye Investigative Journalists (ICIJ) offshore leaks data - Mufananidzo kubva Paco Nathan


Chikamu chekuruboshwe chekuona chinopa girafu shoma uye isina kubatana. Nekudaro, sekuratidzwa kudivi rekurudyi, girafu rakadaro rinogona kuve rakanyatso hukama neyakagadziriswa mubatanidzwa kugadzirisa.


Pakazere, kugadziriswa kwesangano kunowedzera kushanda kwekutora data nekubatanidza, zvichipa maonero akabatana eruzivo munzvimbo dzakasiyana siyana. Iyo inozogonesa yakawanda inoshanda-kupindura mibvunzo zvichibva pane yakavimbika uye yakazara ruzivo girafu.


Nehurombo, vanyori vebepa reGraphRAG havana kuisa chero kodhi kodhi yesarudzo mune yavo repo, kunyangwe vachizvitaura mubepa ravo. Chimwe chikonzero chekusiya iyi kodhi kunze chinogona kunge chakaoma kuita yakasimba uye inonyatsoita sarudzo yekugadzirisa kune chero yakapihwa dura. Iwe unogona kushandisa tsika heuristics yemanode akasiyana kana uchibata neakafanotsanangurwa marudzi emanodhi (kana asina kufanotsanangurwa, iwo haawirirane zvakakwana, sekambani, sangano, bhizinesi, nezvimwewo). Nekudaro, kana ma node label kana mhando dzisingazivikanwe pachine nguva, sezvatiri, iri rinova dambudziko rakatooma. Zvakangodaro, isu tichashandisa vhezheni yekugadziriswa kwesangano mupurojekiti yedu pano, tichibatanidza kumisikidzwa kwemavara uye magirafu algorithms ane chinhambwe chezwi uye maLLM.


Entity resolution inoyerera


Maitiro edu ekugadziriswa kwesangano anosanganisira matanho anotevera:


  1. Masangano ari mugirafu - Tanga nemasangano ese ari mukati megirafu.
  2. K-padyo girafu - Gadzira k-padyo girafu yemuvakidzani, inobatanidza masangano akafanana zvichienderana nekunamirwa kwemavara.
  3. Zvisina Simba Zvakabatanidzwa - Ziva zvisina kusimba zvakabatanidzwa mu k-padyo girafu, kuunganidza masangano anogona kunge akafanana. Wedzera nhanho yekusefa yezwi mushure mekunge zvikamu izvi zvaonekwa.
  4. LLM kuongorora - Shandisa LLM kuongorora zvikamu izvi uye kusarudza kana masangano ari mukati mechikamu chega chega achifanira kubatanidzwa, zvichiita kuti pave nesarudzo yekupedzisira pakugadziriswa kwesangano (semuenzaniso, kubatanidza 'Silicon Valley Bank' uye 'Silicon_Valley_Bank' uku uchiramba kubatanidzwa kwakasiyana. mazuva akaita se'Gunyana 16, 2023' uye 'Gunyana 2, 2023').


Isu tinotanga nekuverenga kumisikidzwa kwemavara kune zita uye tsananguro yezvimiro zvemasangano. Isu tinogona kushandisa iyo from_existing_graph nzira Neo4jVector yekubatanidza muLangChain kuita izvi:


 vector = Neo4jVector.from_existing_graph( OpenAIEmbeddings(), node_label='__Entity__', text_node_properties=['id', 'description'], embedding_node_property='embedding' )


Isu tinokwanisa kushandisa zvakanamirwa izvi kutsvaga vangango kwikwidza vakafanana zvichibva pane cosine chinhambwe chekumisikidzwa uku. Tichashandisa magirafu algorithms anowanikwa muraibhurari yeGrafu Data Science (GDS) ; saka, isu tinogona kushandisa iyo GDS Python mutengi kuitira nyore kushandisa nenzira yePythonic:


 from graphdatascience import GraphDataScience gds = GraphDataScience( os.environ["NEO4J_URI"], auth=(os.environ["NEO4J_USERNAME"], os.environ["NEO4J_PASSWORD"]) )


Kana iwe usiri kujairana neGDS raibhurari, isu tinotanga tagadzira in-memory girafu tisati taita chero graph algorithms.


Graph Data Sayenzi algorithm kuuraya mafambiro


Chekutanga, iyo Neo4j yakachengetwa girafu inoratidzwa mune-mundangariro girafu yekukurumidza kugadzirisa uye kuongorora. Tevere, algorithm yegirafu inoitwa pane in-memory graph. Sarudzo, mhedzisiro yealgorithm inogona kuchengetwa kumashure muNeo4j dhatabhesi. Dzidza zvakawanda nezvazvo mune zvinyorwa .


Kugadzira iyo k-padyo muvakidzani girafu, isu tichagadzira ese masangano pamwe nekumisikidza kwavo zvinyorwa:


 G, result = gds.graph.project( "entities", # Graph name "__Entity__", # Node projection "*", # Relationship projection nodeProperties=["embedding"] # Configuration parameters )


Iye zvino iyo girafu inofungidzirwa pasi pezita entities , tinogona kuita graph algorithms. Tichatanga nekugadzira k-padyo girafu . Iwo maviri akanyanya kukosha paramita anopesvedzera kuti girafu repadyo richave rakapfupika sei kana kuti rakanyanya similarityCutoff uye topK . The topK inhamba yevavakidzani yekutsvaga pa node yega yega, ine hushonga hushoma hwe 1. The similarityCutoff inosefa hukama neinofanana pazasi pechikumbaridzo ichi. Pano, isu tichashandisa yakasarudzika topK yegumi uye yakada kufanana cutoff ye0.95. Kushandisa yakakwira yakafanana cutoff, senge 0.95, inova nechokwadi chekuti mapairi akada kufanana chete anotariswa machisi, kuderedza manyepo enhema uye kugadzirisa huchokwadi.


Kugadzira k-iri pedyo girafu uye kuchengetedza hukama hutsva mugirafu yeprojekti


Sezvo isu tichida kuchengetedza mhedzisiro kumashure kune yakarongwa mu-memory girafu panzvimbo yeruzivo girafu, isu tichashandisa mutate modhi yegorgorithm:


 similarity_threshold = 0.95 gds.knn.mutate( G, nodeProperties=['embedding'], mutateRelationshipType= 'SIMILAR', mutateProperty= 'score', similarityCutoff=similarity_threshold )


Danho rinotevera nderokuona mapoka emasangano akabatana nehukama huchangobva kufungidzira hwekufanana. Kuziva mapoka emanodhi akabatana inzira inogara ichiitwa mukuongororwa kwenetiweki, inowanzonzi nharaunda yekuona kana kubatanidza , iyo inosanganisira kutsvaga zvidimbu zvemanodhi akabatana. Mumuenzaniso uyu, tichashandisa Weakly Connected Components algorithm , iyo inotibatsira kuwana zvikamu zvegirafu apo node dzose dzakabatana, kunyange tikafuratira nzira yezvibatanidza.


Kunyora zvawanikwa zveWCC ku database


Isu tinoshandisa iyo algorithm's write modhi kuchengetedza mhedzisiro kudhatabhesi (yakachengetwa girafu):


 gds.wcc.write( G, writeProperty="wcc", relationshipTypes=["SIMILAR"] )


Mavara ekumisikidza kuenzanisa kunobatsira kuwana angango pirikidzwa, asi ingori chikamu chegadziriro yekugadzirisa. Semuenzaniso, Google neApple zviri pedyo zvakanyanya munzvimbo yekumisikidza (0.96 cosine kufanana uchishandisa iyo ada-002 embedding modhi). Izvo zvakafanana zvinoenda kune BMW neMercedes Benz (0.97 cosine kufanana). Kufanana kwemavara epamusoro ndiko kutanga kwakanaka, asi tinogona kuzvivandudza. Naizvozvo, isu tichawedzera imwe sefa inongobvumira maviri maviri emazwi ane chinhambwe chemavara chevatatu kana mashoma (zvichireva kuti mavara chete anogona kuchinjwa):


 word_edit_distance = 3 potential_duplicate_candidates = graph.query( """MATCH (e:`__Entity__`) WHERE size(e.id) > 3 // longer than 3 characters WITH e.wcc AS community, collect(e) AS nodes, count(*) AS count WHERE count > 1 UNWIND nodes AS node // Add text distance WITH distinct [n IN nodes WHERE apoc.text.distance(toLower(node.id), toLower(n.id)) < $distance OR node.id CONTAINS n.id | n.id] AS intermediate_results WHERE size(intermediate_results) > 1 WITH collect(intermediate_results) AS results // combine groups together if they share elements UNWIND range(0, size(results)-1, 1) as index WITH results, index, results[index] as result WITH apoc.coll.sort(reduce(acc = result, index2 IN range(0, size(results)-1, 1) | CASE WHEN index <> index2 AND size(apoc.coll.intersection(acc, results[index2])) > 0 THEN apoc.coll.union(acc, results[index2]) ELSE acc END )) as combinedResult WITH distinct(combinedResult) as combinedResult // extra filtering WITH collect(combinedResult) as allCombinedResults UNWIND range(0, size(allCombinedResults)-1, 1) as combinedResultIndex WITH allCombinedResults[combinedResultIndex] as combinedResult, combinedResultIndex, allCombinedResults WHERE NOT any(x IN range(0,size(allCombinedResults)-1,1) WHERE x <> combinedResultIndex AND apoc.coll.containsAll(allCombinedResults[x], combinedResult) ) RETURN combinedResult """, params={'distance': word_edit_distance})


Ichi chirevo cheCypher chinobatanidzwa zvishoma, uye dudziro yacho iri pamusoro pechikamu cheiyi blog post. Unogona kugara uchikumbira LLM kuti idudzire.


Anthropic Claude Sonnet 3.5 — Kutsanangura chirevo chekutsunga chesangano


Pamusoro pezvo, izwi rekuti nhambwe cutoff rinogona kunge riri basa rehurefu hweshoko pachinzvimbo chenhamba imwe chete uye kuita kunogona kuve kwakanyanya.


Chakakosha ndechekuti inoburitsa mapoka ezvingangove masangano atingade kubatanidza. Heino rondedzero yemanodhi anogona kubatanidza:


 {'combinedResult': ['Sinn Fein', 'Sinn Féin']}, {'combinedResult': ['Government', 'Governments']}, {'combinedResult': ['Unreal Engine', 'Unreal_Engine']}, {'combinedResult': ['March 2016', 'March 2020', 'March 2022', 'March_2023']}, {'combinedResult': ['Humana Inc', 'Humana Inc.']}, {'combinedResult': ['New York Jets', 'New York Mets']}, {'combinedResult': ['Asia Pacific', 'Asia-Pacific', 'Asia_Pacific']}, {'combinedResult': ['Bengaluru', 'Mangaluru']}, {'combinedResult': ['US Securities And Exchange Commission', 'Us Securities And Exchange Commission']}, {'combinedResult': ['Jp Morgan', 'Jpmorgan']}, {'combinedResult': ['Brighton', 'Brixton']},


Sezvauri kuona, nzira yedu yekugadzirisa inoshanda zvirinani kune mamwe marudzi emanode kupfuura mamwe. Kubva pakuongorora nekukurumidza, zvinoita kunge zvinoshanda zvirinani kune vanhu nemasangano, nepo zvakashata kune misi. Kana tikashandisa zvakafanotsanangurwa node mhando, tinogona kugadzirira akasiyana heuristics emhando dzakasiyana siyana. Mumuenzaniso uyu, hatina mazita akatemerwa node, saka tichatendeukira kuLLM kuti tiite sarudzo yekuti masangano anofanira kubatanidzwa here kana kuti kwete.


Kutanga, isu tinofanirwa kugadzira iyo LLM yekukasira kutungamira zvinobudirira uye kuzivisa sarudzo yekupedzisira maererano nekubatanidzwa kwemanodhi:


 system_prompt = """You are a data processing assistant. Your task is to identify duplicate entities in a list and decide which of them should be merged. The entities might be slightly different in format or content, but essentially refer to the same thing. Use your analytical skills to determine duplicates. Here are the rules for identifying duplicates: 1. Entities with minor typographical differences should be considered duplicates. 2. Entities with different formats but the same content should be considered duplicates. 3. Entities that refer to the same real-world object or concept, even if described differently, should be considered duplicates. 4. If it refers to different numbers, dates, or products, do not merge results """ user_template = """ Here is the list of entities to process: {entities} Please identify duplicates, merge them, and provide the merged list. """


Ini ndinogara ndichida kushandisa with_structured_output nzira muLangChain kana uchitarisira yakarongeka data yakabuda kuti ndirege kupenengura zvinobuda nemaoko.


Pano, tichatsanangura zvakabuda sechinyorwa list of lists , apo chinyorwa chemukati chega chega chine zvinhu zvinofanirwa kubatanidzwa. Ichi chimiro chinoshandiswa kubata zviitiko apo, semuenzaniso, iyo inopinza inogona kunge iri [Sony, Sony Inc, Google, Google Inc] . Mumamiriro ezvinhu akadai, ungada kubatanidza "Sony" ne "Sony Inc" zvakasiyana kubva "Google" uye "Google Inc."


 class DuplicateEntities(BaseModel): entities: List[str] = Field( description="Entities that represent the same object or real-world entity and should be merged" ) class Disambiguate(BaseModel): merge_entities: Optional[List[DuplicateEntities]] = Field( description="Lists of entities that represent the same object or real-world entity and should be merged" ) extraction_llm = ChatOpenAI(model_name="gpt-4o").with_structured_output( Disambiguate )


Zvadaro, tinobatanidza kukurumidza kweLLM nechigadzirwa chakarongeka kuti tigadzire ketani uchishandisa LangChain Expression Language (LCEL) syntax uye inoiputira mukati mekuita disambiguate basa.


 extraction_chain = extraction_prompt | extraction_llm def entity_resolution(entities: List[str]) -> Optional[List[List[str]]]: return [ el.entities for el in extraction_chain.invoke({"entities": entities}).merge_entities ]


Isu tinofanirwa kumhanyisa manode ese anogona kukwikwidza kuburikidza neiyo entity_resolution basa kusarudza kuti akabatanidzwa here. Kuti tikurumidze kuita, tichafananidza zvakare mafoni eLLM:


 merged_entities = [] with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor: # Submitting all tasks and creating a list of future objects futures = [ executor.submit(entity_resolution, el['combinedResult']) for el in potential_duplicate_candidates ] for future in tqdm( as_completed(futures), total=len(futures), desc="Processing documents" ): to_merge = future.result() if to_merge: merged_entities.extend(to_merge)


Nhanho yekupedzisira yekugadziriswa kwesangano inosanganisira kutora zvabuda kubva ku entity_resolution LLM nekuzvinyora kudzose kudhatabhesi nekubatanidza node dzakatarwa:


 graph.query(""" UNWIND $data AS candidates CALL { WITH candidates MATCH (e:__Entity__) WHERE e.id IN candidates RETURN collect(e) AS nodes } CALL apoc.refactor.mergeNodes(nodes, {properties: { description:'combine', `.*`: 'discard' }}) YIELD node RETURN count(*) """, params={"data": merged_entities})


Ichi chigadziriso chesangano hachina kukwana, asi chinotipa pekutangira patinogona kugadzirisa. Pamusoro pezvo, tinokwanisa kuvandudza pfungwa yekuona kuti ndeapi masangano anofanira kuchengetwa.

Element Summarization

Munhanho inotevera, vanyori vanoita nhanho yekupfupisa element. Chaizvoizvo, node yega yega uye hukama hunopfuudzwa kuburikidza nekupfupisa kwesangano kukurumidza . Vanyori vanocherechedza hutsva uye kufarira kwemaitiro avo:


"Pakazere, mashandisiro atinoita manyoro anotsanangura manyoro ehomogeneous node mune inogona kuita ruzha girafu chimiro chakaenderana nehunyanzvi hweLLMs uye zvinodiwa zvepasirese, muchidimbu-yakatarisana nemubvunzo. Unhu uhwu hunosiyanisawo indekisi yedu yegirafu kubva kune yakajairwa ruzivo magirafu, ayo anovimba neruzivo rwakapfupika uye rwunoenderana rutatu (chidzidzo, chirevo, chinhu) pamabasa ekufunga ari pasi.


Pfungwa yacho inofadza. Tichiri kutora maID ezvidzidzo kana mazita kubva muzvinyorwa, izvo zvinotibvumira kubatanidza hukama kugadzirisa masangano, kunyangwe kana masangano achionekwa pane akawanda mavara chunks. Zvisinei, hukama hahuna kuderedzwa kuva rudzi rumwe chete. Pane kudaro, mhando yehukama ichokwadi chinyorwa chemahara chinotitendera kuchengetedza ruzivo rwakapfuma uye rwakanyanya.


Pamusoro pezvo, ruzivo rwesangano rwunopfupikiswa pachishandiswa LLM, zvichitibvumira kudzvanya uye kuratidza ruzivo urwu uye masangano zvine hungwaru kuti titore chaizvo.


Mumwe anogona kupokana kuti ruzivo urwu rwakapfuma uye rwakawanda runogona kuchengetwa nekuwedzera mamwe, pamwe nekupokana, node uye hukama zvivakwa. Imwe nyaya ine zvekupokana node uye hukama zvivakwa ndezvekuti zvinogona kunetsa kuburitsa ruzivo nguva nenguva nekuti iyo LLM inogona kushandisa mazita ezvivakwa zvakasiyana kana kutarisa kune akasiyana siyana pakuurayiwa kwese.


Mamwe ematambudziko aya anogona kugadziriswa uchishandisa mazita ezvivakwa akafanotsanangurwa ane rumwe rudzi uye ruzivo rwetsananguro. Kana zvakadaro, iwe unozoda nyanzvi-yenyaya kuti ibatsire kutsanangura izvo zvivakwa, ichisiya diki nzvimbo yekuti LLM ibudise chero ruzivo rwakakosha kunze kwetsananguro dzakafanotsanangurwa.


Inzira inonakidza yekumiririra ruzivo rwakapfuma mugirafu yeruzivo.


Imwe nyaya inogona kuitika ine nhanho yekupfupisa chinhu ndeyekuti haina kukwira zvakanaka sezvo ichida kufona kweLLM kune yega yega uye hukama mugirafu. Girafu yedu idiki iine 13,000 node uye 16,000 hukama. Kunyangwe kune diki girafu rakadaro, taizoda zviuru makumi maviri nezvipfumbamwe zveLLM mafoni, uye kufona kwega kwega kwaizoshandisa mazana ematokeni, zvichiita kuti idhure uye itore nguva. Nokudaro, tichadzivisa danho iri pano. Tichiri kukwanisa kushandisa zvimiro zvetsanangudzo zvakatorwa panguva yekutanga kunyorwa kwemavara.

Kuvaka uye Kupfupikisa Nharaunda

Nhanho yekupedzisira mukuvaka magirafu uye maitiro ekuisa indexing inosanganisira kuona nharaunda mukati megirafu. Muchirevo chechinyorwa chino, nharaunda iboka remanodhi akabatana zvakanyanya kune mumwe nemumwe kupfuura kune yasara girafu, zvichiratidza mwero wepamusoro wekudyidzana kana kufanana. Chiratidzo chinotevera chinoratidza muenzaniso wekuonekwa kwenharaunda.


Nyika dzine mavara zvichienderana nenharaunda yadzinogara 


Kana nharaunda dzemasangano idzi dzangoonekwa nealgorithm yekubatanidza, LLM inogadzira pfupiso yenharaunda yega yega, ichipa ruzivo rwehunhu hwavo uye hukama.


Zvekare, isu tinoshandisa raibhurari yeGrafu Data Science. Isu tinotanga nekugadzira in-memory girafu. Kuti titevedzere chinyorwa chepakutanga nemazvo, tichagadzira girafu yemasangano seinetiweki isina kuremerwa, apo network inomiririra huwandu hwekubatana pakati pezviviri:


 G, result = gds.graph.project( "communities", # Graph name "__Entity__", # Node projection { "_ALL_": { "type": "*", "orientation": "UNDIRECTED", "properties": {"weight": {"property": "*", "aggregation": "COUNT"}}, } }, )



Vanyori vakashandisa iyo Leiden algorithm , nzira yekubatanidza yehierarchical, kuona nharaunda mukati megirafu. Imwe mukana wekushandisa hierarchical nharaunda yekuona algorithm kugona kuongorora nharaunda pamatanho akawanda e granularity. Vanyori vanokurudzira kupfupisa nharaunda dzese padanho rega rega, kupa kunzwisisa kwakadzama kwechimiro chegirafu.


Kutanga, isu tichashandisa iyo Weakly Yakabatanidzwa Zvikamu (WCC) algorithm kuongorora kubatana kwegirafu yedu. Iyi algorithm inozivisa zvikamu zvakasarudzika mukati megirafu, zvichireva kuti inoona zvidimbu zvenodhi kana zvikamu zvakabatana kune mumwe nemumwe asi kwete kune yasara girafu. Izvi zvikamu zvinotibatsira kunzwisisa kupatsanurwa mukati metiweki uye kuona mapoka emanodhi akazvimirira kubva kune mamwe. WCC yakakosha pakuongorora chimiro chese uye kubatana kwegirafu.


 wcc = gds.wcc.stats(G) print(f"Component count: {wcc['componentCount']}") print(f"Component distribution: {wcc['componentDistribution']}") # Component count: 1119 # Component distribution: { # "min":1, # "p5":1, # "max":9109, # "p999":43, # "p99":19, # "p1":1, # "p10":1, # "p90":7, # "p50":2, # "p25":1, # "p75":4, # "p95":10, # "mean":11.3 }


Mhedzisiro yeWCC algorithm yakaratidza 1,119 zvikamu zvakasiyana. Zvikurukuru, chikamu chikuru chinosanganisira 9,109 node, yakajairika mumatiweki epasirese apo chinhu chimwe chete chepamusoro chinopindirana nezvinhu zvidiki zvidiki zvakasarudzika. Iyo diki chikamu chine node imwe, uye avhareji yechikamu saizi inenge 11.3 node.


Tevere, isu tichamhanyisa iyo Leiden algorithm, iyo inowanikwawo muGDS raibhurari, uye togonesa iyo includeIntermediateCommunities parameter kudzoka uye kuchengeta nharaunda pamatanho ese. Isu takabatanidzawo relationshipWeightProperty paramende kuti timhanye iyo inorema musiyano weLeiden algorithm. Kushandisa iyo write modhi yegorgorithm inochengetedza mhedzisiro senzvimbo yenzvimbo.


 gds.leiden.write( G, writeProperty="communities", includeIntermediateCommunities=True, relationshipWeightProperty="weight", )


Iyo algorithm yakaratidza nhanho shanu dzenharaunda, neiyo yepamusoro-soro (ishoma granular nhanho uko nharaunda dzakakura) ine 1,188 nharaunda (kusiyana ne1,119 zvikamu). Heino kuoneswa kwenharaunda padanho rekupedzisira vachishandisa Gephi.


Chimiro chenharaunda mu Gephi


Kuona nharaunda dzinopfuura 1,000 kwakaoma; kunyange kutora mavara erimwe nerimwe hazvibviri. Nekudaro, ivo vanogadzira dhizaini yakanaka yehunyanzvi.


Kuvaka pane izvi, tichagadzira node yakasarudzika yenharaunda yega yega uye tinomiririra chimiro chavo chehierarchical segirafu yakabatana. Gare gare, isu tichachengetawo pfupiso dzenharaunda uye humwe hunhu senge node zvivakwa.


 graph.query(""" MATCH (e:`__Entity__`) UNWIND range(0, size(e.communities) - 1 , 1) AS index CALL { WITH e, index WITH e, index WHERE index = 0 MERGE (c:`__Community__` {id: toString(index) + '-' + toString(e.communities[index])}) ON CREATE SET c.level = index MERGE (e)-[:IN_COMMUNITY]->(c) RETURN count(*) AS count_0 } CALL { WITH e, index WITH e, index WHERE index > 0 MERGE (current:`__Community__` {id: toString(index) + '-' + toString(e.communities[index])}) ON CREATE SET current.level = index MERGE (previous:`__Community__` {id: toString(index - 1) + '-' + toString(e.communities[index - 1])}) ON CREATE SET previous.level = index - 1 MERGE (previous)-[:IN_COMMUNITY]->(current) RETURN count(*) AS count_1 } RETURN count(*) """)


Vanyori vanosumawo community rank , zvichiratidza huwandu hweakasiyana mavara chunks umo masangano ari munharaunda anooneka:


 graph.query(""" MATCH (c:__Community__)<-[:IN_COMMUNITY*]-(:__Entity__)<-[:MENTIONS]-(d:Document) WITH c, count(distinct d) AS rank SET c.community_rank = rank; """)


Zvino ngationgororei muenzaniso wechimiro chechimiro chine nharaunda zhinji dzepakati dzinosangana pamazinga epamusoro. Nharaunda hadzisi kupindirana, zvichireva kuti mubatanidzwa wega wega ndewenharaunda imwechete padanho rega rega.


Hierarchical community structure; nharaunda dzine orenji uye masangano ane pepuru


Mufananidzo unomiririra chimiro che hierarchical chinobva kuLeiden nharaunda yekuona algorithm. Manodhi epepuru anomiririra masangano ega ega, ukuwo maorenji node anomiririra nharaunda dzine hierarchical.


Urongwa uhwu hunoratidza kurongeka kwemasangano aya munharaunda dzakasiyana siyana, nenharaunda diki dzichibatana kuita hombe panhanho dzepamusoro.


Ngationgororei kuti nharaunda diki dzinobatana sei padanho repamusoro.


Hierarchical community structure


Mufananidzo uyu unoratidza kuti masangano mashoma akabatana uye nokudaro nharaunda diki dzinowana shanduko shoma pamatanho. Semuenzaniso, chimiro chenharaunda pano chinongochinja mumatanho maviri ekutanga asi chinoramba chakafanana pamatanho matatu ekupedzisira. Nekuda kweizvozvo, mazinga ehierarchical anowanzo kuratidzika seasina basa kune aya masangano, sezvo sangano rose risingashanduke zvakanyanya pamatanho akasiyana.


Ngationgororei huwandu hwenharaunda nehukuru hwadzo nematanho akasiyana mune zvakadzama:


 community_size = graph.query( """ MATCH (c:__Community__)<-[:IN_COMMUNITY*]-(e:__Entity__) WITH c, count(distinct e) AS entities RETURN split(c.id, '-')[0] AS level, entities """ ) community_size_df = pd.DataFrame.from_records(community_size) percentiles_data = [] for level in community_size_df["level"].unique(): subset = community_size_df[community_size_df["level"] == level]["entities"] num_communities = len(subset) percentiles = np.percentile(subset, [25, 50, 75, 90, 99]) percentiles_data.append( [ level, num_communities, percentiles[0], percentiles[1], percentiles[2], percentiles[3], percentiles[4], max(subset) ] ) # Create a DataFrame with the percentiles percentiles_df = pd.DataFrame( percentiles_data, columns=[ "Level", "Number of communities", "25th Percentile", "50th Percentile", "75th Percentile", "90th Percentile", "99th Percentile", "Max" ], ) percentiles_df 


Kugoverwa kwesaizi munharaunda ne nhanho


Mukuita kwekutanga, nharaunda padanho rega rega dzakapfupikiswa. Kwatiri, ingave nharaunda 8,590 uye, nekudaro, 8,590 LLM mafoni. Ndingapokana kuti zvichienderana nehurongwa hwenharaunda, kwete nhanho dzese dzinoda kupfupikiswa. Semuenzaniso, musiyano pakati peyekupedzisira uye inotevera-yekupedzisira-nhanho inharaunda ina chete (1,192 vs. 1,188). Naizvozvo, tinenge tichigadzira zvipfupiso zvakawanda zvisina basa. Imwe mhinduro ndeyekugadzira kushandiswa kunogona kuita pfupiso imwe chete kunharaunda dzematanho akasiyana-siyana asingashanduki; imwe ingangove yekupunzika magariro emunharaunda asingachinji.


Zvakare, handina chokwadi kana tichida kupfupisa nharaunda dzine nhengo imwe chete, sezvo dzinogona kusapa kukosha kwakawanda kana ruzivo. Pano, tichapfupisa nharaunda panhanho 0, 1, uye 4. Chekutanga, tinoda kutora ruzivo rwavo kubva mudhatabhesi:


 community_info = graph.query(""" MATCH (c:`__Community__`)<-[:IN_COMMUNITY*]-(e:__Entity__) WHERE c.level IN [0,1,4] WITH c, collect(e ) AS nodes WHERE size(nodes) > 1 CALL apoc.path.subgraphAll(nodes[0], { whitelistNodes:nodes }) YIELD relationships RETURN c.id AS communityId, [n in nodes | {id: n.id, description: n.description, type: [el in labels(n) WHERE el <> '__Entity__'][0]}] AS nodes, [r in relationships | {start: startNode(r).id, type: type(r), end: endNode(r).id, description: r.description}] AS rels """)


Parizvino, ruzivo rwenharaunda rune chimiro chinotevera:


 {'communityId': '0-6014', 'nodes': [{'id': 'Darrell Hughes', 'description': None, type:"Person"}, {'id': 'Chief Pilot', 'description': None, type: "Person"}, ... }], 'rels': [{'start': 'Ryanair Dac', 'description': 'Informed of the change in chief pilot', 'type': 'INFORMED', 'end': 'Irish Aviation Authority'}, {'start': 'Ryanair Dac', 'description': 'Dismissed after internal investigation found unacceptable behaviour', 'type': 'DISMISSED', 'end': 'Aidan Murray'}, ... ]}


Ikozvino, isu tinofanirwa kugadzirira kukurumidza kweLLM inoburitsa muchidimbu mutauro wechisikigo zvichienderana neruzivo rwunopihwa nezvinhu zvenharaunda yedu. Tinogona kutora kumwe kurudziro kubva mukurudziro yakashandiswa nevaongorori .


Vanyori havana kungopfupisa nharaunda asi vakagadzirawo zvakawanikwa kune yega yega. Kuwana kunogona kutsanangurwa seruzivo rwakapfupika nezve chiitiko chakati kana chikamu cheruzivo. Mumwe muenzaniso wakadaro:


 "summary": "Abila City Park as the central location", "explanation": "Abila City Park is the central entity in this community, serving as the location for the POK rally. This park is the common link between all other entities, suggesting its significance in the community. The park's association with the rally could potentially lead to issues such as public disorder or conflict, depending on the nature of the rally and the reactions it provokes. [records: Entities (5), Relationships (37, 38, 39, 40)]"


Yangu intuition inoratidza kuti kutora zvakawanikwa nekupfuura imwe chete kungave kusingave kwakazara sezvatinoda, senge kuburitsa masangano uye hukama.


Uyezve, ini handina kuwana chero mareferensi kana mienzaniso yekushandiswa kwavo mukodhi yavo mune yemuno kana yepasirese yekutsvaga yekutsvaga. Nekuda kweizvozvo, isu tinorega kutora zvakawanikwa mune ino chiitiko. Kana, sezvinowanzotaurwa nevadzidzisi: Ichi chiitiko chinosiiwa kumuverengi. Pamusoro pezvo, isu tasvetukawo zvichemo kana covariate ruzivo rwekutora , izvo zvinotaridzika zvakafanana nezvakawanikwa pakutanga.


Iko kukurumidza kwatichashandisa kugadzira pfupiso yenharaunda yakatwasuka:


 community_template = """Based on the provided nodes and relationships that belong to the same graph community, generate a natural language summary of the provided information: {community_info} Summary:""" # noqa: E501 community_prompt = ChatPromptTemplate.from_messages( [ ( "system", "Given an input triples, generate the information summary. No pre-amble.", ), ("human", community_template), ] ) community_chain = community_prompt | llm | StrOutputParser()


Chinhu chega chasara kushandura zvinomiririra nharaunda kuita tambo kudzikisa huwandu hwematokeni nekudzivisa JSON tokeni pamusoro uye kuputira cheni sebasa:


 def prepare_string(data): nodes_str = "Nodes are:\n" for node in data['nodes']: node_id = node['id'] node_type = node['type'] if 'description' in node and node['description']: node_description = f", description: {node['description']}" else: node_description = "" nodes_str += f"id: {node_id}, type: {node_type}{node_description}\n" rels_str = "Relationships are:\n" for rel in data['rels']: start = rel['start'] end = rel['end'] rel_type = rel['type'] if 'description' in rel and rel['description']: description = f", description: {rel['description']}" else: description = "" rels_str += f"({start})-[:{rel_type}]->({end}){description}\n" return nodes_str + "\n" + rels_str def process_community(community): stringify_info = prepare_string(community) summary = community_chain.invoke({'community_info': stringify_info}) return {"community": community['communityId'], "summary": summary}


Iye zvino tinogona kugadzira pfupiso dzenharaunda dzematanho akasarudzwa. Zvekare, tinofananidza mafoni ekukurumidza kuuraya:


 summaries = [] with ThreadPoolExecutor() as executor: futures = {executor.submit(process_community, community): community for community in community_info} for future in tqdm(as_completed(futures), total=len(futures), desc="Processing communities"): summaries.append(future.result())


Chimwe chinhu chandisina kutaura ndechekuti vanyori vanogadzirisawo nyaya inogona kupfuudza saizi yemamiriro ezvinhu kana vachiisa ruzivo rwenharaunda. Sezvo magirafu achiwedzera, nharaunda dzinogona kukura zvakanyanya zvakare. Kwatiri, nharaunda huru yaive ne545 nhengo. Tichifunga kuti GPT-4o ine saizi yemukati inodarika 100,000 tokens, takasarudza kusvetuka danho iri.


Sedanho redu rekupedzisira, isu tichachengeta nhanho dzenharaunda kudzokera kune database:


 graph.query(""" UNWIND $data AS row MERGE (c:__Community__ {id:row.community}) SET c.summary = row.summary """, params={"data": summaries})


Iyo yekupedzisira girafu chimiro:


Ikozvino girafu rine zvinyorwa zvepakutanga, masangano akatorwa uye hukama, pamwe nehurongwa hwemagariro enharaunda uye pfupiso.

Summary

Vanyori vepepa re "Kubva Kunzvimbo kuenda kuNyika" vakaita basa guru mukuratidza nzira itsva yeGraphRAG. Ivo vanoratidza masanganisiro atingaita uye kupfupikisa ruzivo kubva kune akasiyana magwaro kuita hierarchical ruzivo girafu chimiro.


Chinhu chimwe chisina kutaurwa zvakajeka ndechekuti isu tinokwanisawo kubatanidza zvinyorwa zve data zvakarongwa mugirafu; mapindiro acho haafanire kungogumira kumavara asina kurongeka chete.


Chandinonyanya kufarira nezve nzira yavo yekubvisa ndechekuti ivo vanotora tsananguro dzeese node uye hukama. Tsanangudzo dzinobvumira iyo LLM kuchengetedza ruzivo rwakawanda pane kudzikisa zvese kungoita maID ID uye mhando dzehukama.


Pamusoro pezvo, vanoratidza kuti chidimbu chimwe chete chinopfuura pamusoro pechinyorwa chinogona kusatora ruzivo rwese rwakakodzera uye kuunza pfungwa yekuita akawanda kupfuura kana zvichidikanwa. Vanyori vanopawo zano rinonakidza rekuita pfupiso pamusoro penharaunda dzemagraph, zvichitibvumira kumisa uye indexed yakapfupikiswa ruzivo rwepamusoro pane akawanda data masosi.


Mune inotevera blog positi, isu tichaenda pamusoro penzvimbo uye yepasi rose yekutsvaga yekutsvaga maitirwo uye kutaura nezve dzimwe nzira dzatingaite zvichibva pane yakapihwa girafu chimiro.


Senguva dzose, kodhi inowanikwa paGitHub .


Panguva ino, ini ndakaisawo dhatabhesi yekurasa kuitira kuti iwe ugone kuongorora zvawanikwa uye kuyedza neakasiyana Retriever sarudzo.


Iwe unogona zvakare kupinza iyi yekurasira mune yekusingaperi-yemahara Neo4j AuraDB muenzaniso , iyo yatinogona kushandisa pakutsvaga kutsvaga sezvo isu tisingade Graph Data Sayenzi maalgorithms kune ayo - chete graph patani inofananidzwa, vector, uye yakazara-zvinyorwa zvinyorwa.


Dzidza zvakawanda nezve Neo4j kubatanidzwa neese GenAI masisitimu uye anoshanda magirafu algorithms mubhuku rangu "Graph Algorithms yeData Sayenzi."


Kuti udzidze zvakawanda nezvenyaya iyi, tibatane nesu paNODES 2024 muna Mbudzi 7, yedu yemahara virtual developer musangano pane hungwaru maapplication, ruzivo magirafu, uye AI. Register Now !