Kule bhlog, siza kukubonisa indlela yokwenza i-codebase yeRAG ngeCocoIndex. I-CocoIndex sisixhobo sokukunceda ukuba ubonise kwaye ubuze idatha yakho. Yenzelwe ukuba isetyenziswe njengesakhelo sokwakha owakho umbhobho wedatha. I-CocoIndex ibonelela ngenkxaso eyakhelwe-ngaphakathi kwisiseko sekhowudi ye-chunking, kunye nenkxaso yemveli ye-Tree-sitter.
I-Tree-sitter sisixhobo sokuvelisa i-parser kunye nelayibrari eyongezelelweyo yokwahlula, ifumaneka kwi-Rust 🦀 - GitHub . I-CocoIndex ine-Rust eyakhelwe-ngaphakathi indibaniselwano kunye ne-Tree-sitter ukucazulula ngokufanelekileyo ikhowudi kunye nokukhupha imithi ye-syntax kwiilwimi ezahlukeneyo zokucwangcisa.
I-Codebase chunking yinkqubo yokwahlulahlula isiseko sekhowudi sibe ngamaqhekeza amancinci, anentsingiselo yesemantiki. I-CocoIndex iphakamisa amandla omhlali woMthi ukuba adibanise ngobukrelekrele ikhowudi esekwe kulwakhiwo lwe-syntax yokwenyani kunokuba kuqhawuke umgca ongekho mthethweni. Ezi ziqwenga ezihambelanayo ngokwesemantiki ziye zisetyenziswe ukwakha isalathiso esisebenza ngakumbi kwiinkqubo zeRAG, okwenza ukuba ukufunyanwa kwekhowudi echane ngakumbi kunye nokugcinwa kwemeko engcono.
Ukudlula ngokukhawuleza 🚀 - ungayifumana ikhowudi epheleleyo apha . Kuphela ~ imigca engama-50 yekhowudi yePython yombhobho weRAG, yijonge 🤗!
Nceda unike iCocoIndex kwiGithub inkwenkwezi ukusixhasa ukuba uyawuthanda umsebenzi wethu. Enkosi kakhulu nge coconut hug eshushu 🥥🤗.
Ukuba awunayo i-Postgres efakiweyo, nceda ubhekisele kwisikhokelo sofakelo . I-CocoIndex isebenzisa i-Postgres ukulawula isalathisi sedatha, sinayo kwindlela yethu yokuxhasa ezinye iindawo zolwazi, kuquka eziqhubekayo. Ukuba unomdla kwezinye iidatabase, nceda usazise ngokwenza umba weGitHub okanye iDiscord .
Masichaze ukuhamba kwe-cocoIndex ukufunda kwi-codebase kwaye isalathisi se-RAG.
Umzobo ohambayo ungentla ubonisa indlela esiza kuyiqhuba ngayo ikhowudi yesiseko sethu:
Masiphumeze oku kuhamba ngenyathelo.
@cocoindex.flow_def(name="CodeEmbedding") def code_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope): """ Define an example flow that embeds files into a vector database. """ data_scope["files"] = flow_builder.add_source( cocoindex.sources.LocalFile(path="../..", included_patterns=["*.py", "*.rs", "*.toml", "*.md", "*.mdx"], excluded_patterns=[".*", "target", "**/node_modules"])) code_embeddings = data_scope.add_collector()
Kulo mzekelo, siza kusalathisa i-codebase ye-cocoindex ukusuka kulawulo lweengcambu. Ungatshintsha umendo kwi-codebase ofuna ukuyichaza. Siza kuzalathisa zonke iifayile ngezongezo ze .py
, .rs
, .toml
, .md
, .mdx
, kwaye sitsibe abalawuli ukuqala nge ., target (kwingcambu) kunye node_modules (phantsi kwalo naluphi na ulawulo).
flow_builder.add_source
iyakwenza itheyibhile enemihlaba engaphantsi ilandelayo, bona uxwebhu apha.
filename
(isitshixo, uhlobo: str
): igama lefayile yefayile, umz. dir1/file1.md
content
(uhlobo: str
ukuba binary
Yiyo False
, kungenjalo bytes
): umxholo wefayileKuqala makhe sichaze umsebenzi wokukhupha ulwandiso lwegama lefayile ngelixa kusenziwa ifayile nganye. Ungafumana uxwebhu lomsebenzi wesiko apha .
@cocoindex.op.function() def extract_extension(filename: str) -> str: """Extract the extension of a filename.""" return os.path.splitext(filename)[1]
Emva koko siza kucubungula ifayile nganye kwaye siqokelele ulwazi.
# ... with data_scope["files"].row() as file: file["extension"] = file["filename"].transform(extract_extension)
Apha sikhupha isandiso segama lefayile kwaye siyigcine kwindawo extension
. umzekelo, ukuba igama lefayile yi spec.rs
, indawo extension
iya kuba .rs
.
Okulandelayo, siza kwahlula ifayile ibe ziinqununu. Sisebenzisa umsebenzi SplitRecursively
ukwahlula ifayile ibe ziziqwenga. Ungafumana uxwebhu lomsebenzi apha .
I-CocoIndex inikezela ngenkxaso eyakhelwe-ngaphakathi kwi-Tree-sitter, ngoko unokudlula ngolwimi kwipharamitha language
. Ukubona onke amagama olwimi axhaswayo kunye nezandiso, jonga amaxwebhu apha . Zonke iilwimi eziphambili ziyaxhaswa, umzekelo, iPython, Rust, JavaScript, TypeScript, Java, C++, njl. Ukuba ayichazwanga okanye ulwimi olukhankanyiweyo aluxhaswanga, luya kuphathwa njengombhalo ocacileyo.
with data_scope["files"].row() as file: # ... file["chunks"] = file["content"].transform( cocoindex.functions.SplitRecursively(), language=file["extension"], chunk_size=1000, chunk_overlap=300)
Siza kusebenzisa umsebenzi SentenceTransformerEmbed
ukubethelela iziqwenga. Ungafumana uxwebhu lomsebenzi apha . Kukho iimodeli ezili-12k ezixhaswa ngu 🤗 Ubuso obuHugging . Unokukhetha nje imodeli oyithandayo.
def code_to_embedding(text: cocoindex.DataSlice) -> cocoindex.DataSlice: """ Embed the text using a SentenceTransformer model. """ return text.transform( cocoindex.functions.SentenceTransformerEmbed( model="sentence-transformers/all-MiniLM-L6-v2"))
Emva koko kwichunk nganye, siya kuyizinzisa sisebenzisa code_to_embedding
umsebenzi. kwaye uqokelele izinto ezizinzisiweyo kumqokeleli code_embeddings
.
Sikhupha le khowudi_to_embedding umsebenzi endaweni yokubiza ngokuthe ngqo uguqulo(cocoindex.functions.SentenceTransformerEmbed(...)) endaweni.
Oku kungenxa yokuba sifuna ukwenza le yabelwane phakathi kwesakhiwo somqukuqelo wesalathiso kunye nenkcazo yomphathi wombuzo. Okanye, ukwenza kube lula. Kulungile ukunqanda lo msebenzi uwongezelelweyo kwaye wenze izinto ngokuthe ngqo endaweni-ayisiyonto inkulu ukukopa uncamathisele kancinci, siyenzele iprojekthi yokuqalisa ngokukhawuleza .
with data_scope["files"].row() as file: # ... with file["chunks"].row() as chunk: chunk["embedding"] = chunk["text"].call(code_to_embedding) code_embeddings.collect(filename=file["filename"], location=chunk["location"], code=chunk["text"], embedding=chunk["embedding"])
Okokugqibela, masikhuphele ngaphandle izinto ezizinzisiweyo kwitafile.
code_embeddings.export( "code_embeddings", cocoindex.storages.Postgres(), primary_key_fields=["filename", "location"], vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
Siza kusebenzisa SimpleSemanticsQueryHandler
ukubuza ngesalathiso. Qaphela ukuba kufuneka sidlule code_to_embedding
umsebenzi kwi- query_transform_flow
parameter. Oku kungenxa yokuba umbambi wombuzo uya kusebenzisa imodeli yokufakela efanayo njengaleyo isetyenziswe ekuhambeni.
query_handler = cocoindex.query.SimpleSemanticsQueryHandler( name="SemanticsSearch", flow=code_embedding_flow, target_name="code_embeddings", query_transform_flow=code_to_embedding, default_similarity_metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)
Chaza umsebenzi ongundoqo wokuqhuba umbuzo wesiphathi.
@cocoindex.main_fn() def _run(): # Run queries in a loop to demonstrate the query capabilities. while True: try: query = input("Enter search query (or Enter to quit): ") if query == '': break results, _ = query_handler.search(query, 10) print("\nSearch results:") for result in results: print(f"[{result.score:.3f}] {result.data['filename']}") print(f" {result.data['code']}") print("---") print() except KeyboardInterrupt: break if __name__ == "__main__": load_dotenv(override=True) _run()
Umhombi @ cocoindex.main_fn() uqalisa ilayibrari ngoseto olulayishiweyo ukusuka kwizinto eziguquguqukayo zokusingqongileyo. Jonga uxwebhu lokuqalisa ngeenkcukacha ezithe vetshe.
🎉 Ngoku sele ulungile!
Sebenzisa imiyalelo elandelayo ukuseta nokuhlaziya isalathisi.
python main.py cocoindex setup python main.py cocoindex update
Uyakubona isalathiso sohlaziyo lwemeko kwi-terminal
Vavanya umbuzo
Okwangoku, ungaqala iseva ye-coindex kwaye uphuhlise ixesha lakho le-RAG lokubaleka ngokuchasene nedatha.
Ukuvavanya isalathisi sakho, kukho iindlela ezimbini onokukhetha kuzo:
python main.py
Xa ubona umyalezo, ungafaka umbuzo wokukhangela. umzekelo: spec.
Enter search query (or Enter to quit): spec
Ungafumana iziphumo zophendlo kwi-terminal
Iziphumo ezibuyisiweyo - ungeno ngalunye luqulathe amanqaku (Ukufana kweCosine), igama lefayile, kunye nekhowudi yesnippet edibanayo. Kwi-cocoindex, sisebenzisa i cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY
ukulinganisa ukufana phakathi kombuzo kunye nedatha enesalathisi. Ungatshintshela kwezinye iimetrics kwaye ngokukhawuleza uyivavanye.
Ukuze ufunde okungakumbi malunga nokufana kweConsine, bona i-Wiki .
Inketho yesi-2: Qhuba iCocoInsight ukuze uqonde umbhobho wedatha yakho kunye nesalathisi sedatha
I-CocoInsight sisixhobo sokukunceda uqonde umbhobho wakho wedatha kunye nesalathisi sedatha. Iqhagamshela kwiseva yakho yeCocoIndex yendawo kunye nokugcinwa kwedatha engu-zero.
I-CocoInsight ikwi-Early Access ngoku (Simahla) 😊 Usifumene! Isifundo sevidiyo esikhawulezayo semizuzu emi-3 malunga neCocoInsight: Bukela kwiYouTube .
python main.py cocoindex server -c https://cocoindex.io
Nje ukuba iseva isebenze, vula iCocoInsight kwisikhangeli sakho. Uya kukwazi ukuqhagamshela kwiseva yakho yeCocoIndex kwaye ujonge umbhobho wakho wedatha kunye nesalathiso.
Kwicala lasekunene, unokubona ukuhamba kwedatha esiyichazile.
Kwicala lasekhohlo, unokubona isalathisi sedatha kwi-preview yedatha.
Unokucofa kuwo nawuphi na umqolo ukuze ubone iinkcukacha zolo ngeniso lwedatha, kubandakanywa umxholo opheleleyo weekhowudi zekhowudi kunye nokufakela kwazo.
Siyakuthanda ukuva kuluntu! Ungasifumana kwiGithub kunye neDiscord .
Ukuba uthanda le post kunye nomsebenzi wethu, nceda uxhase iCocoIndex kwiGithub yinkwenkwezi ⭐. Enkosi nge coconut hug eshushu 🥥🤗.