paint-brush
¿Mä Imagen ukat Texto Descripción ukampi kuns thaqhañ munasmati? Mä RAG Multimodal ukamp yant’añaukata@codingjaguar
498 ullart’awinaka
498 ullart’awinaka

¿Mä Imagen ukat Texto Descripción ukampi kuns thaqhañ munasmati? Mä RAG Multimodal ukamp yant’aña

ukata Jiang Chen16m2024/11/27
Read on Terminal Reader

Sinti jaya pachanakawa; Uñxatt’añataki

Mä jach’a uñacht’awi kunjamasa lurasispa mä sistema RAG multimodal Milvus ukampi ukhamaraki kunjamasa jist’arasispa kunaymana lurawinakata sistemas AI ukataki.
featured image - ¿Mä Imagen ukat Texto Descripción ukampi kuns thaqhañ munasmati? Mä RAG Multimodal ukamp yant’aña
Jiang Chen HackerNoon profile picture
0-item
1-item

Aka qillqatanx mä jach’a uñacht’äwiw kunjams mä sistema RAG multimodal Milvus apnaqañax lurasispa ukat kunjams sistemas AI ukatakix kunayman posibilidades ukanakax jist’arasispa.


Mä sapa formato de datos ukar jark’atäñax janiw juk’amp askïxiti. Kunjamakitix negocios ukanakax yatiyawinakar juk’amp atinisipxi wali wakiskir amtawinak lurañataki, jupanakax datos ukar jan walt’ayat formatos ukan chikancht’asiñ yatiñaw munapxi. Wali askiwa, sistemas tradicionales AI constreñidos a un tipo de datos ukaxa sistemas multimodales ukanakaruxa thakhi churaraki, ukaxa amuyaspawa ukatxa complejas yatiyawinakxa lurarakispawa.


Sistemas multimodal de búsqueda ukatxa multimodal retrieval-augmented generation (RAG) ukaxa jichha pachanakanxa jach’a nayraru sartawi uñacht’ayi aka tuqina. Aka sistemas ukax walja kasta datos ukanakaruw apnaqapxi, qillqatanaka, uñacht’awinaka, ukat audio ukanaka, contexto ukarjam jaysäwinak churañataki.


Aka blog tuqinx kunjams lurayirinakax Milvus apnaqasa sistema RAG multimodal jupanakan lurapxaspa ukxatw aruskipt’añäni. Ukhamarakiw ukham sistema lurañanx irpapxäma, ukax qillqat ukhamarak uñacht’äwinakan yatiyawinakap apnaqañapawa, juk’ampis, uñtasit thakhinak lurañapa, ukatx mä modelo de lengua ukar aprovechañax mistuwinak sum uñakipañataki. Ukhamasti, qalltañäni.

¿Kunas Milvo sat chachajja?

Mä vector base de datos ukaxa mä especial tipo de base de datos ukawa, ukaxa vector incrustaciones ukanaka imañataki, índice ukatxa apsuñataki, ukaxa representaciones matemáticas de datos ukawa, ukaxa datos ukanakampi chikachasiñatakixa janiwa equivalencia ukakikiti jan ukasti semántica ukar uñtasitawa. Milvus sat jilataw ukham luräna mä jist’arat, jach’a lurawi vector base de datos ukawa, ukax escala ukatakiw lurata. GitHub ukan jikxatasma Apache-2.0 licencia ukat 30K jila warawaranakampi.


Milvus ukax lurayirinakarux mä flexible solución uñacht’ayañ yanapt’i, jach’a vector ukan yatiyawinakap apnaqañataki ukhamarak jiskt’añataki. Uka suma lurawipax Milvus ukarux mä suma ajlliw tukuyi, lurayirinakatakix aplicaciones lurañatakix modelos de aprendizaje profundo ukampiw lurasi, kunjamakitix generación aumentada de recuperación (RAG), multimodal thakhi thaqhaña, motor de recomendación ukat anomalías uñt’ayaña.


Milvus ukax walja uñstayañ amtanakaw utji, lurayirinakan munataparjama. Milvus Lite ukax mä juk’a pachanakanwa mä k’achachata versión ukawa, ukax mä Python ukan apnaqawipanw apnaqasi ukatx mä local ukan manqhan prototipo lurañatakix wali askiwa. Milvus Standalone ukat Milvus Distributed ukax escalable ukat producción ukar wakicht’at amtawinakawa.

RAG Multimodal: Qillqatat sipan juk’amp jach’anchayaña

Janïr sistema lurañkamaxa, nayra pacha RAG basado en texto ukat evolución ukax RAG Multimodal ukar uñt’ayañax wali askiwa.


Retrieval Augmented Generation (RAG) ukax mä thakhiwa, anqäx tuqit yatiyawinak contextual ukar apsuñataki, ukatx jach’a modelos de lenguas (LLM) ukanakat juk’amp chiqapar mistuñataki. RAG tradicional ukax mä estrategia wali askiwa LLM ukan mistuwipa juk’amp askiptañataki, ukampis ukax textual ukan yatiyawinakapamp chikaw qhiparaski. Walja chiqpach pachan apnaqawinakanx datos ukax qillqat sipans juk’ampiwa —imajinanaka, ch’akhanaka ukat yaqha modalidades ukanakamp chikancht’asiñax contexto critico ukaw utji.


RAG multimodal ukax aka pata tuqin limitación ukarux askichi, kunayman kasta datos ukanakan apnaqañapataki, LLMs ukanakarux juk’amp suma contexto ukaw churaraki.


Mä juk’a arumpixa, mä sistema RAG multimodal ukanxa, componente de recuperación ukaxa kunaymana modalidades de datos ukanakana wakiskir yatiyawinak thaqhi, ukatxa componente de generación ukaxa juk’ampi chiqapa lurawinakxa apsuta yatiyawinakxata uñakipata.

Vector Embeddings ukat Similarity Search ukanak amuyaña

Vector embeddings ukat similaridad thakhix pä fundamental amuyunakawa RAG multimodal ukataki. Uka panpachani amuytʼañäni.

Vector ukan uñt’ayata

Kunjamatix arsuwayktanxa, vector incrustaciones ukax representaciones matemáticas/numéricas de datos ukawa. Maquinanakax uka uñacht’ayawimpiw kunayman kasta yatiyaw semántico amuyunak amuyañatakix apnaqapxi, sañäni, qillqata, uñacht’awinaka, ist’aña.


Kunawsatix proceso de lengua natural (PNL) apnaqaski ukhax documento chimpunakax vectores ukar tukuyatawa, ukatx semánticamente uñtasit arunakax espacio vectorial ukan jak’a chiqanakaruw mapeado. Ukhamarakiw uñacht’awinakax lurasi, kawkhantix embeddings ukax semánticos uñacht’ayi. Ukax métricas ukanakar amuyañatakiw yanapt’istu, kunjamakitix color, textura, ukat objeto ukan uñnaqanakapax mä formato numérico ukan.


Vector embeddings apnaqañan jach’a amtapax kunayman yatiyawinak taypin mayacht’asiwinak ukhamarak jikthaptäwinak jark’aqañ yanapt’añawa.

Uñtasïwi thaqhaña

Uñtasïwi thakhixa mä datos conjunto ukanxa datos ukanaka jikxatañataki ukhamaraki jikxatañatakiw apnaqasi. Vector ukan ch’amanchawipanx, jikthaptañ thakhix vectores ukanakaruw jikxati, ukax churat conjunto de datos ukanx vectores de consulta ukarux juk’amp jak’ankiwa.


Akax mä qawqha lurawinakawa, ukax vectores ukanakan jikthaptawip uñakipañatakix wali apnaqatawa:

  1. Distancia Euclídea : Ukaxa pä chiqaru chiqa chiqaru uñtatawa, ukaxa espacio vectorial ukanxa.
  2. Coseno uñtasita : Pä vectores ukanakan ángulo ukan coseno ukar tupu (mä jach’a uñacht’äwit sipansa, kawkir sarañas uñakipata).
  3. Producto Punto : Mä sanu multiplicación de elementos correspondientes ukanakax suma qhananchatawa.


Uñtasiwi tupu ajlliñax jilpachax aplicación-específica ukan datos ukat kunjams lurayirix jan walt’äwir jak’achasi ukarjamaw lurasi.


Jach’a yatxatatanakan uñtasïwi thakhi lurañaxa, jakhthapiwi ch’amampi ukhamaraki yänakampi munaski ukaxa wali jach’awa. Akax kawkhantix algoritmos de vecino cercano (ANN) ukanakax mantaniwa, algoritmos ANN ukax mä jisk’a porcentaje jan ukax mä cantidad de exactitud ukar aljasiñatakiw apnaqasi, mä jach’a jank’ak jach’anchayañataki. Ukax jupanakarux jach’a apnaqawinakatakix mä aski ajlliw tukuyi.


Milvus ukax nayrar sartañ algoritmos ANN ukanak apnaqaraki, HNSW ukat DiskANN ukanakamp chika, jach’a vector incrustación de datos ukan suma uñtasïw thakhinak lurañataki, ukax lurayirinakarux jank’akiw wakiskir datos puntos ukanakar jikxatañapatak yanapt’i. Ukhamarus, Milvus ukax yaqha algoritmos de indexación ukanakaruw yanapt’i, HSNW, FIV, CAGRA, ukat juk’ampinaka, ukax mä solución vectorial ukar juk’amp eficiente ukhamawa.


Milvus ukamp RAG Multimodal ukan luraña

Jichhax amuyunakax yatxatatäxiwa, Milvus apnaqañ sistema RAG multimodal lurañ pachaw purini. Aka uñacht’awitakix Milvus Lite (Milvus ukan k’achachata versión, yant’añataki ukhamarak prototipo lurañatakix wali askiwa) vector ukar imañataki ukhamarak apsuñataki, BGE ukax chiqaparu uñacht’awinak lurañataki ukhamarak uñstayañataki, ukatx GPT-4o ukax nayrar sartañ resultado reranking ukataki.

Nayraqata wakisirinaka

Nayraqatxa, mä Milvus instancia ukaw wakisi, ukhamat datos ukanakax imañataki. Milvus Lite ukax pip ukampiw utt’ayasispa, Docker ukampiw mä instancia local ukar apnaqasispa, jan ukax Zilliz Cloud tuqiw mä cuenta Milvus ukanx inakiw alojamiento ukar qillqt’asispa.


Payïri, RAG gasoducto ukatakix LLM ukax wakisiwa, ukhamax p’iqinchäwi OpenAI ukax mä juk’a pachanakanwa ukat mä API llave apsuña. Libre nivel ukax aka código irnaqañapatakix wakisiwa.


Ukxarusti, mä machaq directorio ukat mä Python luraña virtual ukan pachapa (jan ukax kuna lurawinakas Python apnaqañatakix apnaqañama).


Aka yatichäwitakix, ukax mä juk’a pachanakanwa pymilvus ukax mä juk’a pachanakanwa biblioteca, ukax Milvus ukan Python SDK oficial ukawa, ukatx mä qawqha herramientas comunes ukanakaw utji.

Milvus Lite ukax utt’ayatawa

 pip install -U pymilvus

Dependencias ukanaka utt’ayaña

 pip install --upgrade pymilvus openai datasets opencv-python timm einops ftfy peft tqdm git clone https://github.com/FlagOpen/FlagEmbedding.git pip install -e FlagEmbedding

Datos ukar apkatañamawa

Aka kamachix uñacht’äw yatiyawinak apsuñapawa ukatx mä local carpeta “./images_folder” ukar apsuñapawa, ukax akanakawa:


  • Imajinanaka: Mä jisk’a t’aqa Amazon ukan uñakipäwipa 2023 ukax niya 900 uñacht'awinakaniwa, ukax "Aparato", "Celulares_y_Accesorios", ukat "Electrónica" ukanakat apst'atawa.
  • Mä uñacht’äwi jiskt’äw uñacht’äwi: leopard.jpg


 wget https://github.com/milvus-io/bootcamp/releases/download/data/amazon_reviews_2023_subset.tar.gz tar -xzf amazon_reviews_2023_subset.tar.gz

Modelo de Embedding ukax mä juk’a pachanakanwa

Jiwasax modelo Visualized BGE “bge-visualized-base-en-v1.5” ukampiw uñacht’awinakatakis qillqatanakatakis incrustaciones uñstayañäni.


Jichhax HuggingFace ukan pesaje ukar apkatañamawa.


 wget https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_base_en_v1.5.pth


Ukatxa, mä codificador lurañäni.

 import torch from visual_bge.modeling import Visualized_BGE class Encoder:    def __init__(self, model_name: str, model_path: str):        self.model = Visualized_BGE(model_name_bge=model_name, model_weight=model_path)        self.model.eval()    def encode_query(self, image_path: str, text: str) -> list[float]:        with torch.no_grad():            query_emb = self.model.encode(image=image_path, text=text)        return query_emb.tolist()[0]    def encode_image(self, image_path: str) -> list[float]:        with torch.no_grad():            query_emb = self.model.encode(image=image_path)        return query_emb.tolist()[0] model_name = "BAAI/bge-base-en-v1.5" model_path = "./Visualized_base_en_v1.5.pth" # Change to your own value if using a different model path encoder = Encoder(model_name, model_path)

Embeddings ukanaka luraña ukatxa Milvus ukar Datos ukanaka apsuña

Aka t’aqax kunjams uñacht’äwinak uñacht’äwinak jiwasan base de datos ukar apkatañax ukxatw uñacht’ayani, ukax correspondientes incrustaciones ukanakampi.


Ukaxa embeddings ukanaka luraña


Nayraqatxa, taqi uñacht’awinakax conjunto de datos ukan utjki ukanakatakix embeddings ukanakaw lurañasa.


Taqi uñacht’awinakax directorio de datos ukanx apkatatawa ukatx embeddings ukar jaqukipatawa.


 import os from tqdm import tqdm from glob import glob data_dir = (    "./images_folder" # Change to your own value if using a different data directory ) image_list = glob(    os.path.join(data_dir, "images", "*.jpg") ) # We will only use images ending with ".jpg" image_dict = {} for image_path in tqdm(image_list, desc="Generating image embeddings: "):    try:        image_dict[image_path] = encoder.encode_image(image_path)    except Exception as e:        print(f"Failed to generate embedding for {image_path}. Skipped.")        continue print("Number of encoded images:", len(image_dict))

Multimodal thakhi luraña ukatxa Rerank uka luraña

Aka t’aqapanxa, nayraqatax mä jiskt’awi multimodal ukampiw wakiskir uñacht’awinak thaqhañäni ukatx mä servicio LLM ukampiw apsutanakax wasitat uñakipt’atäni ukatx mä qhanañchäwimpiw askinjam jikxatañäni.


Multimodal thakhinchawi apnaqaña


Jichhax wakicht’ataw nayrar sartañ multimodal thakhi lurañatakix jiskt’äwimp lurat uñacht’äwimp qillqat yatichäwinakampi.


 query_image = os.path.join(    data_dir, "leopard.jpg" ) # Change to your own query image path query_text = "phone case with this image theme" query_vec = encoder.encode_query(image_path=query_image, text=query_text) search_results = milvus_client.search(    collection_name=collection_name,    data=[query_vec],    output_fields=["image_path"],    limit=9, # Max number of search results to return    search_params={"metric_type": "COSINE", "params": {}}, # Search parameters )[0] retrieved_images = [hit.get("entity").get("image_path") for hit in search_results] print(retrieved_images)


Ukax akham uñacht’ayatawa:


 ['./images_folder/images/518Gj1WQ-RL._AC_.jpg', './images_folder/images/41n00AOfWhL._AC_.jpg'


Ukhamaraki, GPT-4o ukampiwa yatxatatanakaxa mayampi uñakipt’ata


Jichhax, GPT-4o ukampiw apsut jamuqanak uñt’ayañäni, ukatx juk’amp suma uñt’at yatiyäwinak jikxatañäni. LLM ukax kunats ukham ranking ukan jikxatasi uk qhanañcht’arakiniwa.


1. Mä panorámica uñakipaña luraña.


 import numpy as np import cv2 img_height = 300 img_width = 300 row_count = 3 def create_panoramic_view(query_image_path: str, retrieved_images: list) -> np.ndarray:    """    creates a 5x5 panoramic view image from a list of images    args:        images: list of images to be combined    returns:        np.ndarray: the panoramic view image    """    panoramic_width = img_width * row_count    panoramic_height = img_height * row_count    panoramic_image = np.full(        (panoramic_height, panoramic_width, 3), 255, dtype=np.uint8    )    # create and resize the query image with a blue border    query_image_null = np.full((panoramic_height, img_width, 3), 255, dtype=np.uint8)    query_image = Image.open(query_image_path).convert("RGB")    query_array = np.array(query_image)[:, :, ::-1]    resized_image = cv2.resize(query_array, (img_width, img_height))    border_size = 10    blue = (255, 0, 0) # blue color in BGR    bordered_query_image = cv2.copyMakeBorder(        resized_image,        border_size,        border_size,        border_size,        border_size,        cv2.BORDER_CONSTANT,        value=blue,    )    query_image_null[img_height * 2 : img_height * 3, 0:img_width] = cv2.resize(        bordered_query_image, (img_width, img_height)    )    # add text "query" below the query image    text = "query"    font_scale = 1    font_thickness = 2    text_org = (10, img_height * 3 + 30)    cv2.putText(        query_image_null,        text,        text_org,        cv2.FONT_HERSHEY_SIMPLEX,        font_scale,        blue,        font_thickness,        cv2.LINE_AA,    )    # combine the rest of the images into the panoramic view    retrieved_imgs = [        np.array(Image.open(img).convert("RGB"))[:, :, ::-1] for img in retrieved_images    ]    for i, image in enumerate(retrieved_imgs):        image = cv2.resize(image, (img_width - 4, img_height - 4))        row = i // row_count        col = i % row_count        start_row = row * img_height        start_col = col * img_width        border_size = 2        bordered_image = cv2.copyMakeBorder(            image,            border_size,            border_size,            border_size,            border_size,            cv2.BORDER_CONSTANT,            value=(0, 0, 0),        )        panoramic_image[            start_row : start_row + img_height, start_col : start_col + img_width        ] = bordered_image        # add red index numbers to each image        text = str(i)        org = (start_col + 50, start_row + 30)        (font_width, font_height), baseline = cv2.getTextSize(            text, cv2.FONT_HERSHEY_SIMPLEX, 1, 2        )        top_left = (org[0] - 48, start_row + 2)        bottom_right = (org[0] - 48 + font_width + 5, org[1] + baseline + 5)        cv2.rectangle(            panoramic_image, top_left, bottom_right, (255, 255, 255), cv2.FILLED        )        cv2.putText(            panoramic_image,            text,            (start_col + 10, start_row + 30),            cv2.FONT_HERSHEY_SIMPLEX,            1,            (0, 0, 255),            2,            cv2.LINE_AA,        )    # combine the query image with the panoramic view    panoramic_image = np.hstack([query_image_null, panoramic_image])    return panoramic_image


2. Jiskt’awi uñacht’awi ukhamaraki apsuta uñacht’awinaka índices ukanakampi mayachaña mä panorámica uñakipaña.


 from PIL import Image combined_image_path = os.path.join(data_dir, "combined_image.jpg") panoramic_image = create_panoramic_view(query_image, retrieved_images) cv2.imwrite(combined_image_path, panoramic_image) combined_image = Image.open(combined_image_path) show_combined_image = combined_image.resize((300, 300)) show_combined_image.show() 


Multimodal ukan thakhiparjama

3. Yatxatawi mayampi uñakipaña ukatxa qhananchaña


Taqi mayacht’at uñacht’awinakx servicio multimodal LLM ukaruw apayanipxañäni, suma jiskt’awinakampi, ukhamat apsutanakax mä qhanañchäwimp chikancht’asiñataki. Qhanacht’awi: GPT-4o ukax LLM ukham ch’amanchañatakix, wakicht’añaw wakisi OpenAI API ukax mä llave ukhamawa nayraqata.


 import requests import base64 openai_api_key = "sk-***" # Change to your OpenAI API Key def generate_ranking_explanation(    combined_image_path: str, caption: str, infos: dict = None ) -> tuple[list[int], str]:    with open(combined_image_path, "rb") as image_file:        base64_image = base64.b64encode(image_file.read()).decode("utf-8")    information = (        "You are responsible for ranking results for a Composed Image Retrieval. "        "The user retrieves an image with an 'instruction' indicating their retrieval intent. "        "For example, if the user queries a red car with the instruction 'change this car to blue,' a similar type of car in blue would be ranked higher in the results. "        "Now you would receive instruction and query image with blue border. Every item has its red index number in its top left. Do not misunderstand it. "        f"User instruction: {caption} \n\n"    )    # add additional information for each image    if infos:        for i, info in enumerate(infos["product"]):            information += f"{i}. {info}\n"    information += (        "Provide a new ranked list of indices from most suitable to least suitable, followed by an explanation for the top 1 most suitable item only. "        "The format of the response has to be 'Ranked list: []' with the indices in brackets as integers, followed by 'Reasons:' plus the explanation why this most fit user's query intent."    )    headers = {        "Content-Type": "application/json",        "Authorization": f"Bearer {openai_api_key}",    }    payload = {        "model": "gpt-4o",        "messages": [            {                "role": "user",                "content": [                    {"type": "text", "text": information},                    {                        "type": "image_url",                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},                    },                ],            }        ],        "max_tokens": 300,    }    response = requests.post(        "https://api.openai.com/v1/chat/completions", headers=headers, json=payload    )    result = response.json()["choices"][0]["message"]["content"]    # parse the ranked indices from the response    start_idx = result.find("[")    end_idx = result.find("]")    ranked_indices_str = result[start_idx + 1 : end_idx].split(",")    ranked_indices = [int(index.strip()) for index in ranked_indices_str]    # extract explanation    explanation = result[end_idx + 1 :].strip()    return ranked_indices, explanation


Ranking tukuyatatxa índices de imagen ukanaka apsuña ukatxa kunatsa wali suma resultado ukaxa:


 ranked_indices, explanation = generate_ranking_explanation(    combined_image_path, query_text )


4. Qhanancht’awimpi suma lurawi uñacht’ayaña


 print(explanation) best_index = ranked_indices[0] best_img = Image.open(retrieved_images[best_index]) best_img = best_img.resize((150, 150)) best_img.show()


Utjirinaka:


 Reasons: The most suitable item for the user's query intent is index 6 because the instruction specifies a phone case with the theme of the image, which is a leopard. The phone case with index 6 has a thematic design resembling the leopard pattern, making it the closest match to the user's request for a phone case with the image theme. 



Leopardo imprimir teléfono caja - Suma Resultado


Aka cuaderno ukanx taqpach código uñakipt'apxañäni . Kunjams mä demostración en línea ukax aka yatichäwimp qalltañax uk juk’amp yatxatañatakix, uñakipt’añäni uñacht’awi apnaqaña .

Tukuyawi

Aka blog tuqinx mä sistema RAG multimodal uka lurañ tuqitw aruskipt’apxta Milvus sat jilataw ukham luräna (mä base de datos vectorial abierto ukan uñt’ayata). Kunjams lurayirinakax Milvus ukar utt’ayapxaspa, uñacht’äwin yatiyawinakap apkatapxaspa, uñtasïw thakhinak lurapxaspa, ukatx mä LLM ukampiw apsutanak wasitat uñt’ayasipxaspa, juk’amp chiqap jaysäwinakataki.


Soluciones multimodal RAG ukax sistemas AI ukatakix kunayman posibilidades ukanakaw jist’arasi, ukax jasakiw amuyasispa ukatx walja formas de datos ukanakaw lurasispa. Yaqhip común posibilidades ukanakax juk’amp suma uñacht’awinak thaqhañ motores, juk’amp suma contexto-driven resultados ukat juk’ampinaka.

L O A D I N G
. . . comments & more!

About Author

Jiang Chen HackerNoon profile picture
Jiang Chen@codingjaguar
Jiang Chen is the Head of AI Platform and Ecosystem at Zilliz.

HANG TAGS ukax mä juk’a pachanakanwa

AKA ARTÍCULO UKHAMARAKI UKHAMARAKI...