Ujumuishaji wa LLM na uwezo wa sauti umeunda fursa mpya katika mwingiliano wa kibinafsi wa wateja. Mwongozo huu utakuelekeza katika kusanidi seva ya ndani ya LLM inayoauni mwingiliano wa sauti wa njia mbili kwa kutumia Python, Transformers, Qwen2-Audio-7B-Instruct, na Gome. Masharti Kabla hatujaanza, utakuwa na zifuatazo zilizosakinishwa: : Toleo la 3.9 au la juu zaidi. Python : Kwa kuendesha mifano. PyTorch : Hutoa ufikiaji wa mfano wa Qwen. Transfoma : Inahitajika katika baadhi ya mazingira. Ongeza kasi : Kwa usindikaji wa sauti. FFmpeg & pydub : Kuunda seva ya wavuti. FastAPI : Seva ya ASGI ili kuendesha FastAPI. Uvicorn : Kwa usanisi wa maandishi-hadi-hotuba. Gome : Ili kudhibiti sauti. Multipart & Scipy FFmpeg inaweza kusanikishwa kupitia kwenye Linux au kwenye MacOS. apt install ffmpeg brew install ffmpeg Unaweza kusanikisha utegemezi wa Python kwa kutumia bomba: pip install torch transformers accelerate pydub fastapi uvicorn bark python-multipart scipy Hatua ya 1: Kuweka Mazingira Kwanza, wacha tuanzishe mazingira yetu ya Python na tuchague kifaa chetu cha PyTorch: import torch device = 'cuda' if torch.cuda.is_available() else 'cpu' Msimbo huu hukagua ikiwa GPU inayooana na CUDA (Nvidia) inapatikana na kusanidi kifaa ipasavyo. Ikiwa hakuna GPU kama hiyo inayopatikana, PyTorch badala yake itaendesha kwenye CPU ambayo ni polepole zaidi. Kwa vifaa vipya vya Apple Silicon, kifaa kinaweza pia kuwekwa kwa ili kuendesha PyTorch kwenye Metal, lakini utekelezaji wa PyTorch Metal sio wa kina. mps Hatua ya 2: Kupakia Mfano LLM nyingi za chanzo huria hutumika tu na maandishi na utoaji wa maandishi. Walakini, kwa kuwa tunataka kuunda mfumo wa kutoa sauti ndani ya sauti, hii ingetuhitaji kutumia miundo miwili zaidi (1) kubadilisha hotuba kuwa maandishi kabla ya kulishwa kuwa LLM yetu na (2) kubadilisha pato la LLM kurudi. kwenye hotuba. Kwa kutumia LLM yenye miundo mingi kama vile Sauti ya Qwen, tunaweza kuepuka modeli moja ya kuchakata ingizo la usemi kuwa jibu la maandishi, na kisha tu kutumia muundo wa pili kubadilisha towe la LLM kuwa usemi. Mbinu hii ya aina nyingi sio tu ya ufanisi zaidi katika suala la wakati wa usindikaji na (V) matumizi ya RAM, lakini pia kwa kawaida hutoa matokeo bora zaidi kwa vile sauti ya ingizo hutumwa moja kwa moja kwa LLM bila msuguano wowote. Ikiwa unatumia seva pangishi ya GPU ya wingu kama au , utataka kuweka saraka za HuggingFace home & Bark kwenye hifadhi yako ya sauti kwa kuendesha & kabla ya kupakua. mifano. Runpod Vast export HF_HOME=/workspace/hf export XDG_CACHE_HOME=/workspace/bark from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device) Tulichagua kutumia kibadala kidogo cha 7B cha mfululizo wa modeli ya Sauti ya Qwen hapa ili kupunguza mahitaji yetu ya kimahesabu. Hata hivyo, Qwen anaweza kuwa ametoa mifano ya sauti yenye nguvu na kubwa zaidi wakati unasoma makala hii. Unaweza ili kuangalia mara mbili kuwa unatumia muundo wao wa hivi punde. kutazama miundo yote ya Qwen kwenye HuggingFace Kwa mazingira ya uzalishaji, unaweza kutaka kutumia injini ya uelekezaji ya haraka kama kwa upitishaji wa juu zaidi. vLLM Hatua ya 3: Inapakia muundo wa Gome Gome ni muundo wa kisasa wa AI wa chanzo huria wa maandishi-hadi-hotuba unaoauni lugha nyingi pamoja na madoido ya sauti. from bark import SAMPLE_RATE, generate_audio, preload_models preload_models() Kando na Gome, unaweza pia kutumia mifano mingine ya chanzo-wazi au ya umiliki ya maandishi-hadi-hotuba. Kumbuka kwamba ingawa wamiliki wanaweza kuwa watendaji zaidi, wanakuja kwa gharama kubwa zaidi. . Uwanja wa TTS huhifadhi ulinganisho wa kisasa Na Qwen Audio 7B na Gome zote zikiwa zimepakiwa kwenye kumbukumbu, takriban (V) matumizi ya RAM ni 24GB, kwa hivyo hakikisha maunzi yako yanatumia hili. Vinginevyo, kuokoa kwenye kumbukumbu. unaweza kutumia toleo la quantized la mfano wa Qwen Hatua ya 4: Kuweka Seva ya FastAPI Tutaunda seva ya FastAPI yenye njia mbili za kushughulikia maingizo ya sauti au maandishi yanayoingia na kurudisha majibu ya sauti. from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn app = FastAPI() @app.post("/voice") async def voice_interaction(file: UploadFile): # TODO return @app.post("/text") async def text_interaction(text: str = Form(...)): # TODO return if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000) Seva hii inakubali faili za sauti kupitia maombi ya POST kwenye & endpoint. /voice /text Hatua ya 5: Inachakata Ingizo la Sauti Tutatumia ffmpeg kuchakata sauti inayoingia na kuitayarisha kwa mfano wa Qwen. from pydub import AudioSegment from io import BytesIO import numpy as np def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array Hatua ya 6: Kuzalisha Majibu ya Maandishi na Qwen Kwa sauti iliyochakatwa, tunaweza kutoa jibu la maandishi kwa kutumia mfano wa Qwen. Hii itahitaji kushughulikia maandishi na sauti. Kichakataji kitabadilisha ingizo letu hadi kiolezo cha gumzo cha modeli (ChatML katika kesi ya Qwen). def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response Jisikie huru kucheza na vigezo vya uzalishaji kama vile halijoto kwenye kitendakazi cha . model.generate Hatua ya 7: Kubadilisha Maandishi hadi Kuzungumza kwa Gome Hatimaye, tutabadilisha majibu ya maandishi yaliyotolewa kuwa matamshi. from scipy.io.wavfile import write as write_wav def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer Hatua ya 8: Kuunganisha Kila kitu kwenye API Sasisha miisho ili kuchakata sauti au ingizo la maandishi, kutoa jibu, na kurudisha hotuba iliyosanisishwa kama faili ya WAV. @app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") Unaweza kuchagua pia kuongeza ujumbe wa mfumo kwenye mazungumzo ili kupata udhibiti zaidi wa majibu ya mratibu. Hatua ya 9: Kujaribu vitu Tunaweza kutumia kuweka seva yetu kama ifuatavyo: curl # Audio input curl -X POST http://localhost:8000/voice --output output.wav -F "[email protected]" # Text input curl -X POST http://localhost:8000/text --output output.wav -H "Content-Type: application/x-www-form-urlencoded" -d "text=Hey" Hitimisho Kwa kufuata hatua hizi, umeweka seva rahisi ya ndani yenye uwezo wa mwingiliano wa sauti wa njia mbili kwa kutumia miundo ya hali ya juu. Usanidi huu unaweza kutumika kama msingi wa kuunda programu ngumu zaidi zinazowezeshwa na sauti. Maombi Iwapo unatafuta njia za kuchuma mapato kwa miundo ya lugha inayoendeshwa na AI, zingatia programu hizi zinazowezekana: Chatbots (km , ); Tabia AI NSFW AI Chat Mawakala wa Simu (km , ) Synthflow Bland Usaidizi wa Kiotomatiki kwa Wateja (kwa mfano , ) Zendesk Forethought Wasaidizi wa Kisheria ( , ) Harvey AI Leya AI Msimbo kamili import torch from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration from bark import SAMPLE_RATE, generate_audio, preload_models from scipy.io.wavfile import write as write_wav from pydub import AudioSegment from io import BytesIO import numpy as np device = 'cuda' if torch.cuda.is_available() else 'cpu' model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device) preload_models() app = FastAPI() def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer @app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)