Aka qillqatanx mä jach’a uñacht’äwiw kunjams mä sistema RAG multimodal Milvus apnaqañax lurasispa ukat kunjams sistemas AI ukatakix kunayman posibilidades ukanakax jist’arasispa.
Mä sapa formato de datos ukar jark’atäñax janiw juk’amp askïxiti. Kunjamakitix negocios ukanakax yatiyawinakar juk’amp atinisipxi wali wakiskir amtawinak lurañataki, jupanakax datos ukar jan walt’ayat formatos ukan chikancht’asiñ yatiñaw munapxi. Wali askiwa, sistemas tradicionales AI constreñidos a un tipo de datos ukaxa sistemas multimodales ukanakaruxa thakhi churaraki, ukaxa amuyaspawa ukatxa complejas yatiyawinakxa lurarakispawa.
Sistemas multimodal de búsqueda ukatxa multimodal retrieval-augmented generation (RAG) ukaxa jichha pachanakanxa jach’a nayraru sartawi uñacht’ayi aka tuqina. Aka sistemas ukax walja kasta datos ukanakaruw apnaqapxi, qillqatanaka, uñacht’awinaka, ukat audio ukanaka, contexto ukarjam jaysäwinak churañataki.
Aka blog tuqinx kunjams lurayirinakax Milvus apnaqasa sistema RAG multimodal jupanakan lurapxaspa ukxatw aruskipt’añäni. Ukhamarakiw ukham sistema lurañanx irpapxäma, ukax qillqat ukhamarak uñacht’äwinakan yatiyawinakap apnaqañapawa, juk’ampis, uñtasit thakhinak lurañapa, ukatx mä modelo de lengua ukar aprovechañax mistuwinak sum uñakipañataki. Ukhamasti, qalltañäni.
Mä vector base de datos ukaxa mä especial tipo de base de datos ukawa, ukaxa vector incrustaciones ukanaka imañataki, índice ukatxa apsuñataki, ukaxa representaciones matemáticas de datos ukawa, ukaxa datos ukanakampi chikachasiñatakixa janiwa equivalencia ukakikiti jan ukasti semántica ukar uñtasitawa.
Milvus ukax lurayirinakarux mä flexible solución uñacht’ayañ yanapt’i, jach’a vector ukan yatiyawinakap apnaqañataki ukhamarak jiskt’añataki. Uka suma lurawipax Milvus ukarux mä suma ajlliw tukuyi, lurayirinakatakix aplicaciones lurañatakix modelos de aprendizaje profundo ukampiw lurasi, kunjamakitix generación aumentada de recuperación (RAG), multimodal thakhi thaqhaña, motor de recomendación ukat anomalías uñt’ayaña.
Milvus ukax walja uñstayañ amtanakaw utji, lurayirinakan munataparjama.
Janïr sistema lurañkamaxa, nayra pacha RAG basado en texto ukat evolución ukax RAG Multimodal ukar uñt’ayañax wali askiwa.
Retrieval Augmented Generation (RAG) ukax mä thakhiwa, anqäx tuqit yatiyawinak contextual ukar apsuñataki, ukatx jach’a modelos de lenguas (LLM) ukanakat juk’amp chiqapar mistuñataki. RAG tradicional ukax mä estrategia wali askiwa LLM ukan mistuwipa juk’amp askiptañataki, ukampis ukax textual ukan yatiyawinakapamp chikaw qhiparaski. Walja chiqpach pachan apnaqawinakanx datos ukax qillqat sipans juk’ampiwa —imajinanaka, ch’akhanaka ukat yaqha modalidades ukanakamp chikancht’asiñax contexto critico ukaw utji.
RAG multimodal ukax aka pata tuqin limitación ukarux askichi, kunayman kasta datos ukanakan apnaqañapataki, LLMs ukanakarux juk’amp suma contexto ukaw churaraki.
Mä juk’a arumpixa, mä sistema RAG multimodal ukanxa, componente de recuperación ukaxa kunaymana modalidades de datos ukanakana wakiskir yatiyawinak thaqhi, ukatxa componente de generación ukaxa juk’ampi chiqapa lurawinakxa apsuta yatiyawinakxata uñakipata.
Vector embeddings ukat similaridad thakhix pä fundamental amuyunakawa RAG multimodal ukataki. Uka panpachani amuytʼañäni.
Kunjamatix arsuwayktanxa, vector incrustaciones ukax representaciones matemáticas/numéricas de datos ukawa. Maquinanakax uka uñacht’ayawimpiw kunayman kasta yatiyaw semántico amuyunak amuyañatakix apnaqapxi, sañäni, qillqata, uñacht’awinaka, ist’aña.
Kunawsatix proceso de lengua natural (PNL) apnaqaski ukhax documento chimpunakax vectores ukar tukuyatawa, ukatx semánticamente uñtasit arunakax espacio vectorial ukan jak’a chiqanakaruw mapeado. Ukhamarakiw uñacht’awinakax lurasi, kawkhantix embeddings ukax semánticos uñacht’ayi. Ukax métricas ukanakar amuyañatakiw yanapt’istu, kunjamakitix color, textura, ukat objeto ukan uñnaqanakapax mä formato numérico ukan.
Vector embeddings apnaqañan jach’a amtapax kunayman yatiyawinak taypin mayacht’asiwinak ukhamarak jikthaptäwinak jark’aqañ yanapt’añawa.
Uñtasïwi thakhixa mä datos conjunto ukanxa datos ukanaka jikxatañataki ukhamaraki jikxatañatakiw apnaqasi. Vector ukan ch’amanchawipanx, jikthaptañ thakhix vectores ukanakaruw jikxati, ukax churat conjunto de datos ukanx vectores de consulta ukarux juk’amp jak’ankiwa.
Akax mä qawqha lurawinakawa, ukax vectores ukanakan jikthaptawip uñakipañatakix wali apnaqatawa:
Uñtasiwi tupu ajlliñax jilpachax aplicación-específica ukan datos ukat kunjams lurayirix jan walt’äwir jak’achasi ukarjamaw lurasi.
Jach’a yatxatatanakan uñtasïwi thakhi lurañaxa, jakhthapiwi ch’amampi ukhamaraki yänakampi munaski ukaxa wali jach’awa. Akax kawkhantix algoritmos de vecino cercano (ANN) ukanakax mantaniwa, algoritmos ANN ukax mä jisk’a porcentaje jan ukax mä cantidad de exactitud ukar aljasiñatakiw apnaqasi, mä jach’a jank’ak jach’anchayañataki. Ukax jupanakarux jach’a apnaqawinakatakix mä aski ajlliw tukuyi.
Milvus ukax nayrar sartañ algoritmos ANN ukanak apnaqaraki, HNSW ukat DiskANN ukanakamp chika, jach’a vector incrustación de datos ukan suma uñtasïw thakhinak lurañataki, ukax lurayirinakarux jank’akiw wakiskir datos puntos ukanakar jikxatañapatak yanapt’i. Ukhamarus, Milvus ukax yaqha algoritmos de indexación ukanakaruw yanapt’i, HSNW, FIV, CAGRA, ukat juk’ampinaka, ukax mä solución vectorial ukar juk’amp eficiente ukhamawa.
Jichhax amuyunakax yatxatatäxiwa, Milvus apnaqañ sistema RAG multimodal lurañ pachaw purini. Aka uñacht’awitakix Milvus Lite (Milvus ukan k’achachata versión, yant’añataki ukhamarak prototipo lurañatakix wali askiwa) vector ukar imañataki ukhamarak apsuñataki, BGE ukax chiqaparu uñacht’awinak lurañataki ukhamarak uñstayañataki, ukatx GPT-4o ukax nayrar sartañ resultado reranking ukataki.
Nayraqatxa, mä Milvus instancia ukaw wakisi, ukhamat datos ukanakax imañataki. Milvus Lite ukax pip ukampiw utt’ayasispa, Docker ukampiw mä instancia local ukar apnaqasispa, jan ukax Zilliz Cloud tuqiw mä cuenta Milvus ukanx inakiw alojamiento ukar qillqt’asispa.
Payïri, RAG gasoducto ukatakix LLM ukax wakisiwa, ukhamax p’iqinchäwi
Ukxarusti, mä machaq directorio ukat mä Python luraña
Aka yatichäwitakix, ukax mä juk’a pachanakanwa
pip install -U pymilvus
pip install --upgrade pymilvus openai datasets opencv-python timm einops ftfy peft tqdm git clone https://github.com/FlagOpen/FlagEmbedding.git pip install -e FlagEmbedding
Aka kamachix uñacht’äw yatiyawinak apsuñapawa ukatx mä local carpeta “./images_folder” ukar apsuñapawa, ukax akanakawa:
wget https://github.com/milvus-io/bootcamp/releases/download/data/amazon_reviews_2023_subset.tar.gz tar -xzf amazon_reviews_2023_subset.tar.gz
Jiwasax modelo Visualized BGE “bge-visualized-base-en-v1.5” ukampiw uñacht’awinakatakis qillqatanakatakis incrustaciones uñstayañäni.
Jichhax HuggingFace ukan pesaje ukar apkatañamawa.
wget https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_base_en_v1.5.pth
Ukatxa, mä codificador lurañäni.
import torch from visual_bge.modeling import Visualized_BGE class Encoder: def __init__(self, model_name: str, model_path: str): self.model = Visualized_BGE(model_name_bge=model_name, model_weight=model_path) self.model.eval() def encode_query(self, image_path: str, text: str) -> list[float]: with torch.no_grad(): query_emb = self.model.encode(image=image_path, text=text) return query_emb.tolist()[0] def encode_image(self, image_path: str) -> list[float]: with torch.no_grad(): query_emb = self.model.encode(image=image_path) return query_emb.tolist()[0] model_name = "BAAI/bge-base-en-v1.5" model_path = "./Visualized_base_en_v1.5.pth" # Change to your own value if using a different model path encoder = Encoder(model_name, model_path)
Aka t’aqax kunjams uñacht’äwinak uñacht’äwinak jiwasan base de datos ukar apkatañax ukxatw uñacht’ayani, ukax correspondientes incrustaciones ukanakampi.
Ukaxa embeddings ukanaka luraña
Nayraqatxa, taqi uñacht’awinakax conjunto de datos ukan utjki ukanakatakix embeddings ukanakaw lurañasa.
Taqi uñacht’awinakax directorio de datos ukanx apkatatawa ukatx embeddings ukar jaqukipatawa.
import os from tqdm import tqdm from glob import glob data_dir = ( "./images_folder" # Change to your own value if using a different data directory ) image_list = glob( os.path.join(data_dir, "images", "*.jpg") ) # We will only use images ending with ".jpg" image_dict = {} for image_path in tqdm(image_list, desc="Generating image embeddings: "): try: image_dict[image_path] = encoder.encode_image(image_path) except Exception as e: print(f"Failed to generate embedding for {image_path}. Skipped.") continue print("Number of encoded images:", len(image_dict))
Aka t’aqapanxa, nayraqatax mä jiskt’awi multimodal ukampiw wakiskir uñacht’awinak thaqhañäni ukatx mä servicio LLM ukampiw apsutanakax wasitat uñakipt’atäni ukatx mä qhanañchäwimpiw askinjam jikxatañäni.
Multimodal thakhinchawi apnaqaña
Jichhax wakicht’ataw nayrar sartañ multimodal thakhi lurañatakix jiskt’äwimp lurat uñacht’äwimp qillqat yatichäwinakampi.
query_image = os.path.join( data_dir, "leopard.jpg" ) # Change to your own query image path query_text = "phone case with this image theme" query_vec = encoder.encode_query(image_path=query_image, text=query_text) search_results = milvus_client.search( collection_name=collection_name, data=[query_vec], output_fields=["image_path"], limit=9, # Max number of search results to return search_params={"metric_type": "COSINE", "params": {}}, # Search parameters )[0] retrieved_images = [hit.get("entity").get("image_path") for hit in search_results] print(retrieved_images)
Ukax akham uñacht’ayatawa:
['./images_folder/images/518Gj1WQ-RL._AC_.jpg', './images_folder/images/41n00AOfWhL._AC_.jpg'
Ukhamaraki, GPT-4o ukampiwa yatxatatanakaxa mayampi uñakipt’ata
Jichhax, GPT-4o ukampiw apsut jamuqanak uñt’ayañäni, ukatx juk’amp suma uñt’at yatiyäwinak jikxatañäni. LLM ukax kunats ukham ranking ukan jikxatasi uk qhanañcht’arakiniwa.
1. Mä panorámica uñakipaña luraña.
import numpy as np import cv2 img_height = 300 img_width = 300 row_count = 3 def create_panoramic_view(query_image_path: str, retrieved_images: list) -> np.ndarray: """ creates a 5x5 panoramic view image from a list of images args: images: list of images to be combined returns: np.ndarray: the panoramic view image """ panoramic_width = img_width * row_count panoramic_height = img_height * row_count panoramic_image = np.full( (panoramic_height, panoramic_width, 3), 255, dtype=np.uint8 ) # create and resize the query image with a blue border query_image_null = np.full((panoramic_height, img_width, 3), 255, dtype=np.uint8) query_image = Image.open(query_image_path).convert("RGB") query_array = np.array(query_image)[:, :, ::-1] resized_image = cv2.resize(query_array, (img_width, img_height)) border_size = 10 blue = (255, 0, 0) # blue color in BGR bordered_query_image = cv2.copyMakeBorder( resized_image, border_size, border_size, border_size, border_size, cv2.BORDER_CONSTANT, value=blue, ) query_image_null[img_height * 2 : img_height * 3, 0:img_width] = cv2.resize( bordered_query_image, (img_width, img_height) ) # add text "query" below the query image text = "query" font_scale = 1 font_thickness = 2 text_org = (10, img_height * 3 + 30) cv2.putText( query_image_null, text, text_org, cv2.FONT_HERSHEY_SIMPLEX, font_scale, blue, font_thickness, cv2.LINE_AA, ) # combine the rest of the images into the panoramic view retrieved_imgs = [ np.array(Image.open(img).convert("RGB"))[:, :, ::-1] for img in retrieved_images ] for i, image in enumerate(retrieved_imgs): image = cv2.resize(image, (img_width - 4, img_height - 4)) row = i // row_count col = i % row_count start_row = row * img_height start_col = col * img_width border_size = 2 bordered_image = cv2.copyMakeBorder( image, border_size, border_size, border_size, border_size, cv2.BORDER_CONSTANT, value=(0, 0, 0), ) panoramic_image[ start_row : start_row + img_height, start_col : start_col + img_width ] = bordered_image # add red index numbers to each image text = str(i) org = (start_col + 50, start_row + 30) (font_width, font_height), baseline = cv2.getTextSize( text, cv2.FONT_HERSHEY_SIMPLEX, 1, 2 ) top_left = (org[0] - 48, start_row + 2) bottom_right = (org[0] - 48 + font_width + 5, org[1] + baseline + 5) cv2.rectangle( panoramic_image, top_left, bottom_right, (255, 255, 255), cv2.FILLED ) cv2.putText( panoramic_image, text, (start_col + 10, start_row + 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2, cv2.LINE_AA, ) # combine the query image with the panoramic view panoramic_image = np.hstack([query_image_null, panoramic_image]) return panoramic_image
2. Jiskt’awi uñacht’awi ukhamaraki apsuta uñacht’awinaka índices ukanakampi mayachaña mä panorámica uñakipaña.
from PIL import Image combined_image_path = os.path.join(data_dir, "combined_image.jpg") panoramic_image = create_panoramic_view(query_image, retrieved_images) cv2.imwrite(combined_image_path, panoramic_image) combined_image = Image.open(combined_image_path) show_combined_image = combined_image.resize((300, 300)) show_combined_image.show()
3. Yatxatawi mayampi uñakipaña ukatxa qhananchaña
Taqi mayacht’at uñacht’awinakx servicio multimodal LLM ukaruw apayanipxañäni, suma jiskt’awinakampi, ukhamat apsutanakax mä qhanañchäwimp chikancht’asiñataki. Qhanacht’awi: GPT-4o ukax LLM ukham ch’amanchañatakix, wakicht’añaw wakisi
import requests import base64 openai_api_key = "sk-***" # Change to your OpenAI API Key def generate_ranking_explanation( combined_image_path: str, caption: str, infos: dict = None ) -> tuple[list[int], str]: with open(combined_image_path, "rb") as image_file: base64_image = base64.b64encode(image_file.read()).decode("utf-8") information = ( "You are responsible for ranking results for a Composed Image Retrieval. " "The user retrieves an image with an 'instruction' indicating their retrieval intent. " "For example, if the user queries a red car with the instruction 'change this car to blue,' a similar type of car in blue would be ranked higher in the results. " "Now you would receive instruction and query image with blue border. Every item has its red index number in its top left. Do not misunderstand it. " f"User instruction: {caption} \n\n" ) # add additional information for each image if infos: for i, info in enumerate(infos["product"]): information += f"{i}. {info}\n" information += ( "Provide a new ranked list of indices from most suitable to least suitable, followed by an explanation for the top 1 most suitable item only. " "The format of the response has to be 'Ranked list: []' with the indices in brackets as integers, followed by 'Reasons:' plus the explanation why this most fit user's query intent." ) headers = { "Content-Type": "application/json", "Authorization": f"Bearer {openai_api_key}", } payload = { "model": "gpt-4o", "messages": [ { "role": "user", "content": [ {"type": "text", "text": information}, { "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}, }, ], } ], "max_tokens": 300, } response = requests.post( "https://api.openai.com/v1/chat/completions", headers=headers, json=payload ) result = response.json()["choices"][0]["message"]["content"] # parse the ranked indices from the response start_idx = result.find("[") end_idx = result.find("]") ranked_indices_str = result[start_idx + 1 : end_idx].split(",") ranked_indices = [int(index.strip()) for index in ranked_indices_str] # extract explanation explanation = result[end_idx + 1 :].strip() return ranked_indices, explanation
Ranking tukuyatatxa índices de imagen ukanaka apsuña ukatxa kunatsa wali suma resultado ukaxa:
ranked_indices, explanation = generate_ranking_explanation( combined_image_path, query_text )
4. Qhanancht’awimpi suma lurawi uñacht’ayaña
print(explanation) best_index = ranked_indices[0] best_img = Image.open(retrieved_images[best_index]) best_img = best_img.resize((150, 150)) best_img.show()
Utjirinaka:
Reasons: The most suitable item for the user's query intent is index 6 because the instruction specifies a phone case with the theme of the image, which is a leopard. The phone case with index 6 has a thematic design resembling the leopard pattern, making it the closest match to the user's request for a phone case with the image theme.
Aka cuaderno ukanx taqpach código uñakipt'apxañäni . Kunjams mä demostración en línea ukax aka yatichäwimp qalltañax uk juk’amp yatxatañatakix, uñakipt’añäni
Aka blog tuqinx mä sistema RAG multimodal uka lurañ tuqitw aruskipt’apxta
Soluciones multimodal RAG ukax sistemas AI ukatakix kunayman posibilidades ukanakaw jist’arasi, ukax jasakiw amuyasispa ukatx walja formas de datos ukanakaw lurasispa. Yaqhip común posibilidades ukanakax juk’amp suma uñacht’awinak thaqhañ motores, juk’amp suma contexto-driven resultados ukat juk’ampinaka.