Ukuhlanganiswa kwama-LLM namandla ezwi kudale amathuba amasha ekusebenzelaneni komuntu siqu kwamakhasimende. Lo mhlahlandlela uzohamba nawe ekusetheni iseva yendawo ye-LLM esekela ukusebenzisana kwezwi okubili kusetshenziswa iPython, Transformers, Qwen2-Audio-7B-Instruct, kanye neBark. Okudingekayo Ngaphambi kokuthi siqale, uzofaka okulandelayo: : Inguqulo engu-3.9 noma ngaphezulu. I-Python : Ngokusebenzisa amamodeli. I-PyTorch : Inikeza ukufinyelela kumodeli ye-Qwen. Ama-Transformers : Kudingeka kwezinye izindawo. Sheshisa : Ngokucubungula umsindo. I-FFmpeg ne-pydub : Ukudala iseva yewebhu. FastAPI : Iseva ye-ASGI ukusebenzisa i-FastAPI. I-Uvicorn : Okokuhlanganiswa kombhalo-kuya-inkulumo. Igxolo : Ukukhohlisa umsindo. I-Multipart & Scipy I-FFmpeg ingafakwa nge ku-Linux noma ku-MacOS. apt install ffmpeg brew install ffmpeg Ungafaka ukuncika kwePython usebenzisa ipayipi: pip install torch transformers accelerate pydub fastapi uvicorn bark python-multipart scipy Isinyathelo 1: Ukusetha Imvelo Okokuqala, ake simise indawo yethu yePython bese sikhetha idivayisi yethu ye-PyTorch: import torch device = 'cuda' if torch.cuda.is_available() else 'cpu' Le khodi ihlola ukuthi ingabe i-GPU ehambisana ne-CUDA (Nvidia) iyatholakala futhi isetha idivayisi ngendlela efanele. Uma ingekho i-GPU enjalo etholakalayo, i-PyTorch izosebenza ku-CPU ehamba kancane kakhulu. Kumadivayisi amasha we-Apple Silicon, idivayisi ingasethwa futhi ibe yi ukuze isebenzise i-PyTorch ku-Metal, kodwa ukuqaliswa kwe-PyTorch Metal akuphelele. mps Isinyathelo sesi-2: Ilayisha Imodeli Ama-LLM amaningi omthombo ovulekile asekela kuphela okokufaka kombhalo nokuphumayo kombhalo. Kodwa-ke, njengoba sifuna ukudala isistimu yokuphuma ngezwi, lokhu kuzodinga ukuthi sisebenzise amamodeli amabili ngaphezulu ukuze (1) siguqule inkulumo ibe umbhalo ngaphambi kokuthi ifakwe ku-LLM yethu kanye (2) nokuguqula okukhiphayo kwe-LLM kubuye. enkulumweni. Ngokusebenzisa i-LLM enezimo eziningi njenge-Qwen Audio, singakwazi ukubalekela imodeli eyodwa ukuze sicubungule okokufaka kwenkulumo kube impendulo yombhalo, bese kufanele sisebenzise imodeli yesibili kuphela ukuguqula okukhiphayo kwe-LLM kubuyisele enkulumweni. Le ndlela yokwenza izinto eziningi ayisebenzi nje kuphela ngokusebenza kahle ngokwesikhathi sokucubungula kanye (V) nokusetshenziswa kwe-RAM, kodwa futhi ngokuvamile iveza imiphumela engcono njengoba umsindo wokufakwayo uthunyelwa ngokuqondile ku-LLM ngaphandle kokungqubuzana. Uma usebenzisa umsingathi we-GPU wefu njenge noma , uzofuna ukusetha inkomba ye-HuggingFace home & Bark kusitoreji sakho sevolumu ngokusebenzisa & ngaphambi kokulanda amamodeli. -Runpod i-Vast export HF_HOME=/workspace/hf export XDG_CACHE_HOME=/workspace/bark from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device) Sikhethe ukusebenzisa okuhlukile okuncane kwe-7B kochungechunge lwemodeli ye-Qwen Audio lapha ukuze sinciphise izidingo zethu zokubala. Kodwa-ke, kungenzeka ukuthi u-Qwen usekhiphe amamodeli alalelwayo anamandla namakhudlwana ngesikhathi ufunda lesi sihloko. Ungabuka ukuze uhlole kabili ukuthi usebenzisa imodeli yawo yakamuva. wonke amamodeli we-Qwen ku-HuggingFace Ukuze uthole indawo yokukhiqiza, ungase ufune ukusebenzisa injini ye-inference esheshayo efana ne ukuze uthole ukuphuma okuphakeme kakhulu. -vLLM Isinyathelo sesi-3: Ilayisha imodeli ye-Bark I-Bark iyimodeli ye-AI yesimanjemanje yomthombo ovulekile wombhalo-kuya-inkulumo esekela izilimi eziningi kanye nemisindo. from bark import SAMPLE_RATE, generate_audio, preload_models preload_models() Ngaphandle kwe-Bark, ungasebenzisa futhi amanye amamodeli omthombo ovulekile noma ophathelene nombhalo-kuya-inkulumo. Khumbula ukuthi nakuba abanikazi bempahla bengase basebenze kakhulu, beza ngezindleko eziphakeme kakhulu. . Inkundla ye-TTS igcina ukuqhathanisa kwakamuva Ngokulayishwa kokubili kwe-Qwen Audio 7B ne-Bark kunkumbulo, ukusetshenziswa kwe-RAM (V) okulinganiselwe kungu-24GB, ngakho qiniseka ukuthi izingxenyekazi zekhompuyutha zakho ziyakusekela lokhu. Uma kungenjalo, ukuze ulondoloze kumemori. ungasebenzisa inguqulo ye-quantized yemodeli ye-Qwen Isinyathelo sesi-4: Ukusetha iseva ye-FastAPI Sizodala iseva ye-FastAPI enemizila emibili yokusingatha umsindo ongenayo noma okokufaka kombhalo futhi sibuyisele izimpendulo zomsindo. from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn app = FastAPI() @app.post("/voice") async def voice_interaction(file: UploadFile): # TODO return @app.post("/text") async def text_interaction(text: str = Form(...)): # TODO return if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000) Le seva yamukela amafayela alalelwayo ngezicelo ze-POST endaweni & . /voice /text Isinyathelo sesi-5: Icubungula Okokufaka Komsindo Sizosebenzisa i-ffmpeg ukucubungula umsindo ongenayo futhi siwulungiselele imodeli ye-Qwen. from pydub import AudioSegment from io import BytesIO import numpy as np def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array Isinyathelo sesi-6: Khiqiza Impendulo Yombhalo no-Qwen Ngomsindo ocutshunguliwe, singakwazi ukukhiqiza impendulo yombhalo sisebenzisa imodeli ye-Qwen. Lokhu kuzodinga ukuphatha kokubili okokufaka kombhalo nokomsindo. Iprosesa izoguqula okokufaka kwethu kube isifanekiso sengxoxo semodeli (i-ChatML esimweni sikaQwen). def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response Zizwe ukhululekile ukudlala ngamapharamitha esizukulwane njengezinga lokushisa kuhlelo lokusebenza . model.generate Isinyathelo sesi-7: Ukuguqula Umbhalo Ukuze Ukhulume Ngegxolo Ekugcineni, sizoguqula impendulo yombhalo okhiqiziwe ibuyele enkulumweni. from scipy.io.wavfile import write as write_wav def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer Isinyathelo sesi-8: Ukuhlanganisa yonke into kuma-API Buyekeza izindawo zokugcina ukuze ucubungule umsindo noma okokufaka kombhalo, ukhiqize impendulo, futhi ubuyisele inkulumo ehlanganisiwe njengefayela le-WAV. @app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") Ungakhetha futhi ukwengeza umlayezo wesistimu ezingxoxweni ukuze uthole ukulawula okwengeziwe kuzimpendulo zomsizi. Isinyathelo 9: Hlola izinto Singasebenzisa ukufaka iseva yethu ngale ndlela elandelayo: curl # Audio input curl -X POST http://localhost:8000/voice --output output.wav -F "[email protected]" # Text input curl -X POST http://localhost:8000/text --output output.wav -H "Content-Type: application/x-www-form-urlencoded" -d "text=Hey" Isiphetho Ngokulandela lezi zinyathelo, usethe iseva yasendaweni elula ekwazi ukusebenzisana nezwi lezindlela ezimbili usebenzisa amamodeli asezingeni eliphezulu. Lokhu kusetha kungasebenza njengesisekelo sokwakha izinhlelo zokusebenza ezinamandla kakhulu ezisebenzisa izwi. Izinhlelo zokusebenza Uma uhlola izindlela zokwenza imali ngamamodeli olimi axhaswe yi-AI, cabanga ngalezi zinhlelo zokusebenza ezingaba khona: Ama-Chatbots (isb. , ); Character AI NSFW AI Chat Ama-ejenti efoni (isb. , ) Synthflow Bland I-Automation Support Customer (isb. , ) Zendesk Forethought Abasizi Bezomthetho ( , ) Harvey AI Leya AI Ikhodi egcwele import torch from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration from bark import SAMPLE_RATE, generate_audio, preload_models from scipy.io.wavfile import write as write_wav from pydub import AudioSegment from io import BytesIO import numpy as np device = 'cuda' if torch.cuda.is_available() else 'cpu' model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device) preload_models() app = FastAPI() def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer @app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)