paint-brush
Ngaba ufuna i-AI ukuba iyiqonde ngokwenyani iKhowudi yakho? Esi sixhobo sithi sinokuncedange@badmonster0
Imbali entsha

Ngaba ufuna i-AI ukuba iyiqonde ngokwenyani iKhowudi yakho? Esi sixhobo sithi sinokunceda

nge LJ8m2025/03/21
Read on Terminal Reader

Inde kakhulu; Ukufunda

Isikhokelo senyathelo nenyathelo kwi-index basebase ye-RAG ene-CocoIndex kunye ne-Tree-sitter: i-chunking, i-embedding, ukukhangela kwe-semantic, kunye nokwakha isalathisi se-vector ukwenzela ukufunyanwa ngokufanelekileyo.
featured image - Ngaba ufuna i-AI ukuba iyiqonde ngokwenyani iKhowudi yakho? Esi sixhobo sithi sinokunceda
LJ HackerNoon profile picture
0-item

Kule bhlog, siza kukubonisa indlela yokwenza i-codebase yeRAG ngeCocoIndex. I-CocoIndex sisixhobo sokukunceda ukuba ubonise kwaye ubuze idatha yakho. Yenzelwe ukuba isetyenziswe njengesakhelo sokwakha owakho umbhobho wedatha. I-CocoIndex ibonelela ngenkxaso eyakhelwe-ngaphakathi kwisiseko sekhowudi ye-chunking, kunye nenkxaso yemveli ye-Tree-sitter.

Umgcini- mithi

I-Tree-sitter sisixhobo sokuvelisa i-parser kunye nelayibrari eyongezelelweyo yokwahlula, ifumaneka kwi-Rust 🦀 - GitHub . I-CocoIndex ine-Rust eyakhelwe-ngaphakathi indibaniselwano kunye ne-Tree-sitter ukucazulula ngokufanelekileyo ikhowudi kunye nokukhupha imithi ye-syntax kwiilwimi ezahlukeneyo zokucwangcisa.


I-Codebase chunking yinkqubo yokwahlulahlula isiseko sekhowudi sibe ngamaqhekeza amancinci, anentsingiselo yesemantiki. I-CocoIndex iphakamisa amandla omhlali woMthi ukuba adibanise ngobukrelekrele ikhowudi esekwe kulwakhiwo lwe-syntax yokwenyani kunokuba kuqhawuke umgca ongekho mthethweni. Ezi ziqwenga ezihambelanayo ngokwesemantiki ziye zisetyenziswe ukwakha isalathiso esisebenza ngakumbi kwiinkqubo zeRAG, okwenza ukuba ukufunyanwa kwekhowudi echane ngakumbi kunye nokugcinwa kwemeko engcono.


Ukudlula ngokukhawuleza 🚀 - ungayifumana ikhowudi epheleleyo apha . Kuphela ~ imigca engama-50 yekhowudi yePython yombhobho weRAG, yijonge 🤗!

Nceda unike iCocoIndex kwiGithub inkwenkwezi ukusixhasa ukuba uyawuthanda umsebenzi wethu. Enkosi kakhulu nge coconut hug eshushu 🥥🤗.

Izinto ezifunekayo kuqala

Ukuba awunayo i-Postgres efakiweyo, nceda ubhekisele kwisikhokelo sofakelo . I-CocoIndex isebenzisa i-Postgres ukulawula isalathisi sedatha, sinayo kwindlela yethu yokuxhasa ezinye iindawo zolwazi, kuquka eziqhubekayo. Ukuba unomdla kwezinye iidatabase, nceda usazise ngokwenza umba weGitHub okanye iDiscord .

Chaza ukuhamba kwe -cocoIndex

Masichaze ukuhamba kwe-cocoIndex ukufunda kwi-codebase kwaye isalathisi se-RAG.

Ukuqukuqela kweCocoIndex yokuFakelo lweKhowudi


Umzobo ohambayo ungentla ubonisa indlela esiza kuyiqhuba ngayo ikhowudi yesiseko sethu:

  1. Funda iifayile zekhowudi kwindlela yefayile yendawo
  2. Khipha izandiso zefayile
  3. Yahlula ikhowudi ibe ziziqwenga zesemantic usebenzisa i-Tree-sitter
  4. Yenza izifakelo kwiqhekeza ngalinye
  5. Gcina kwisiseko sedatha yevektha ukuze ufunyanwe


Masiphumeze oku kuhamba ngenyathelo.

1. Yongeza i-codebase njengomthombo

 @cocoindex.flow_def(name="CodeEmbedding") def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): """ Define an example flow that embeds files into a vector database. """ data_scope["files"] = flow_builder.add_source( cocoindex.sources.LocalFile(path="../..", included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx"], excluded_patterns=[".*", "target", "**/node_modules"])) code_embeddings = data_scope.add_collector()

Kulo mzekelo, siza kusalathisa i-codebase ye-cocoindex ukusuka kulawulo lweengcambu. Ungatshintsha umendo kwi-codebase ofuna ukuyichaza. Siza kuzalathisa zonke iifayile ngezongezo ze .py , .rs , .toml , .md , .mdx , kwaye sitsibe abalawuli ukuqala nge ., target (kwingcambu) kunye node_modules (phantsi kwalo naluphi na ulawulo).

flow_builder.add_source iyakwenza itheyibhile enemihlaba engaphantsi ilandelayo, bona uxwebhu apha.

  • filename (isitshixo, uhlobo: str ): igama lefayile yefayile, umz. dir1/file1.md
  • content (uhlobo: str ukuba binary Yiyo False , kungenjalo bytes ): umxholo wefayile

2. Yenza ifayile nganye kwaye uqokelele ulwazi

2.1 Khupha ulwandiso lwegamalefayile

Kuqala makhe sichaze umsebenzi wokukhupha ulwandiso lwegama lefayile ngelixa kusenziwa ifayile nganye. Ungafumana uxwebhu lomsebenzi wesiko apha .

 @cocoindex.op.function() def extract_extension(filename: str) -> str: """Extract the extension of a filename.""" return os.path.splitext(filename)[1]


Emva koko siza kucubungula ifayile nganye kwaye siqokelele ulwazi.

 # ... with data_scope["files"].row() as file: file["extension"] = file["filename"].transform(extract_extension)


Apha sikhupha isandiso segama lefayile kwaye siyigcine kwindawo extension . umzekelo, ukuba igama lefayile yi spec.rs , indawo extension iya kuba .rs .

2.2 Yahlula ifayile ibe ziziqwenga

Okulandelayo, siza kwahlula ifayile ibe ziinqununu. Sisebenzisa umsebenzi SplitRecursively ukwahlula ifayile ibe ziziqwenga. Ungafumana uxwebhu lomsebenzi apha .


I-CocoIndex inikezela ngenkxaso eyakhelwe-ngaphakathi kwi-Tree-sitter, ngoko unokudlula ngolwimi kwipharamitha language . Ukubona onke amagama olwimi axhaswayo kunye nezandiso, jonga amaxwebhu apha . Zonke iilwimi eziphambili ziyaxhaswa, umzekelo, iPython, Rust, JavaScript, TypeScript, Java, C++, njl. Ukuba ayichazwanga okanye ulwimi olukhankanyiweyo aluxhaswanga, luya kuphathwa njengombhalo ocacileyo.


 with data_scope["files"].row() as file: # ... file["chunks"] = file["content"].transform( cocoindex.functions.SplitRecursively(), language=file["extension"], chunk_size=1000, chunk_overlap=300)

2.3 Zinzisa amaqhekeza

Siza kusebenzisa umsebenzi SentenceTransformerEmbed ukubethelela iziqwenga. Ungafumana uxwebhu lomsebenzi apha . Kukho iimodeli ezili-12k ezixhaswa ngu 🤗 Ubuso obuHugging . Unokukhetha nje imodeli oyithandayo.

 def code_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice: """ Embed the text using a SentenceTransformer model. """ return text.transform( cocoindex.functions.SentenceTransformerEmbed( model="sentence-transformers/all-MiniLM-L6-v2"))


Emva koko kwichunk nganye, siya kuyizinzisa sisebenzisa code_to_embedding umsebenzi. kwaye uqokelele izinto ezizinzisiweyo kumqokeleli code_embeddings .


Sikhupha le khowudi_to_embedding umsebenzi endaweni yokubiza ngokuthe ngqo uguqulo(cocoindex.functions.SentenceTransformerEmbed(...)) endaweni.


Oku kungenxa yokuba sifuna ukwenza le yabelwane phakathi kwesakhiwo somqukuqelo wesalathiso kunye nenkcazo yomphathi wombuzo. Okanye, ukwenza kube lula. Kulungile ukunqanda lo msebenzi uwongezelelweyo kwaye wenze izinto ngokuthe ngqo endaweni-ayisiyonto inkulu ukukopa uncamathisele kancinci, siyenzele iprojekthi yokuqalisa ngokukhawuleza .


 with data_scope["files"].row() as file: # ... with file["chunks"].row() as chunk: chunk["embedding"] = chunk["text"].call(code_to_embedding) code_embeddings.collect(filename=file["filename"], location=chunk["location"], code=chunk["text"], embedding=chunk["embedding"])

2.4 Qokelela izifakelo

Okokugqibela, masikhuphele ngaphandle izinto ezizinzisiweyo kwitafile.

 code_embeddings.export( "code_embeddings", cocoindex.storages.Postgres(), primary_key_fields=["filename", "location"], vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

3. Cwangcisa i-Query Handler kwisalathiso sakho

Siza kusebenzisa SimpleSemanticsQueryHandler ukubuza ngesalathiso. Qaphela ukuba kufuneka sidlule code_to_embedding umsebenzi kwi- query_transform_flow parameter. Oku kungenxa yokuba umbambi wombuzo uya kusebenzisa imodeli yokufakela efanayo njengaleyo isetyenziswe ekuhambeni.

 query_handler = cocoindex.query.SimpleSemanticsQueryHandler( name="SemanticsSearch", flow=code_embedding_flow, target_name="code_embeddings", query_transform_flow=code_to_embedding, default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)


Chaza umsebenzi ongundoqo wokuqhuba umbuzo wesiphathi.

 @cocoindex.main_fn() def _run(): # Run queries in a loop to demonstrate the query capabilities. while True: try: query = input("Enter search query (or Enter to quit): ") if query == '': break results, _ = query_handler.search(query, 10) print("\nSearch results:") for result in results: print(f"[{result.score:.3f}] {result.data['filename']}") print(f" {result.data['code']}") print("---") print() except KeyboardInterrupt: break if __name__ == "__main__": load_dotenv(override=True) _run()


Umhombi @ cocoindex.main_fn() uqalisa ilayibrari ngoseto olulayishiweyo ukusuka kwizinto eziguquguqukayo zokusingqongileyo. Jonga uxwebhu lokuqalisa ngeenkcukacha ezithe vetshe.

Qalisa ukuseta isalathisi kunye nokuhlaziya

🎉 Ngoku sele ulungile!

Sebenzisa imiyalelo elandelayo ukuseta nokuhlaziya isalathisi.

 python main.py cocoindex setup python main.py cocoindex update


Uyakubona isalathiso sohlaziyo lwemeko kwi-terminal

Itheminali ebonisa inkqubo yohlaziyo lwesalathiso


Vavanya umbuzo


Okwangoku, ungaqala iseva ye-coindex kwaye uphuhlise ixesha lakho le-RAG lokubaleka ngokuchasene nedatha.


Ukuvavanya isalathisi sakho, kukho iindlela ezimbini onokukhetha kuzo:

Inketho 1: Sebenzisa iseva yesalathiso kwi -terminal

 python main.py


Xa ubona umyalezo, ungafaka umbuzo wokukhangela. umzekelo: spec.

 Enter search query (or Enter to quit): spec


Ungafumana iziphumo zophendlo kwi-terminal

Iziphumo zokukhangela kwitheminali


Iziphumo ezibuyisiweyo - ungeno ngalunye luqulathe amanqaku (Ukufana kweCosine), igama lefayile, kunye nekhowudi yesnippet edibanayo. Kwi-cocoindex, sisebenzisa i cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY ukulinganisa ukufana phakathi kombuzo kunye nedatha enesalathisi. Ungatshintshela kwezinye iimetrics kwaye ngokukhawuleza uyivavanye.


Ukuze ufunde okungakumbi malunga nokufana kweConsine, bona i-Wiki .


Inketho yesi-2: Qhuba iCocoInsight ukuze uqonde umbhobho wedatha yakho kunye nesalathisi sedatha


I-CocoInsight sisixhobo sokukunceda uqonde umbhobho wakho wedatha kunye nesalathisi sedatha. Iqhagamshela kwiseva yakho yeCocoIndex yendawo kunye nokugcinwa kwedatha engu-zero.


I-CocoInsight ikwi-Early Access ngoku (Simahla) 😊 Usifumene! Isifundo sevidiyo esikhawulezayo semizuzu emi-3 malunga neCocoInsight: Bukela kwiYouTube .

Qalisa iseva yeCocoIndex

 python main.py cocoindex server -c https://cocoindex.io


Nje ukuba iseva isebenze, vula iCocoInsight kwisikhangeli sakho. Uya kukwazi ukuqhagamshela kwiseva yakho yeCocoIndex kwaye ujonge umbhobho wakho wedatha kunye nesalathiso.

I-CocoInsight UI ebonisa ukuphononongwa kwedatha


Kwicala lasekunene, unokubona ukuhamba kwedatha esiyichazile.


Kwicala lasekhohlo, unokubona isalathisi sedatha kwi-preview yedatha.

I-CocoInsight Data Preview ebonisa iinqununu zekhowudi ezinesalathiso

Unokucofa kuwo nawuphi na umqolo ukuze ubone iinkcukacha zolo ngeniso lwedatha, kubandakanywa umxholo opheleleyo weekhowudi zekhowudi kunye nokufakela kwazo.

Uluntu

Siyakuthanda ukuva kuluntu! Ungasifumana kwiGithub kunye neDiscord .


Ukuba uthanda le post kunye nomsebenzi wethu, nceda uxhase iCocoIndex kwiGithub yinkwenkwezi ⭐. Enkosi nge coconut hug eshushu 🥥🤗.