Ujumuishaji wa LLM na uwezo wa sauti umeunda fursa mpya katika mwingiliano wa kibinafsi wa wateja.
Mwongozo huu utakuelekeza katika kusanidi seva ya ndani ya LLM inayoauni mwingiliano wa sauti wa njia mbili kwa kutumia Python, Transformers, Qwen2-Audio-7B-Instruct, na Gome.
Kabla hatujaanza, utakuwa na zifuatazo zilizosakinishwa:
FFmpeg inaweza kusanikishwa kupitia apt install ffmpeg
kwenye Linux au brew install ffmpeg
kwenye MacOS.
Unaweza kusanikisha utegemezi wa Python kwa kutumia bomba: pip install torch transformers accelerate pydub fastapi uvicorn bark python-multipart scipy
Kwanza, wacha tuanzishe mazingira yetu ya Python na tuchague kifaa chetu cha PyTorch:
import torch device = 'cuda' if torch.cuda.is_available() else 'cpu'
Msimbo huu hukagua ikiwa GPU inayooana na CUDA (Nvidia) inapatikana na kusanidi kifaa ipasavyo.
Ikiwa hakuna GPU kama hiyo inayopatikana, PyTorch badala yake itaendesha kwenye CPU ambayo ni polepole zaidi.
Kwa vifaa vipya vya Apple Silicon, kifaa kinaweza pia kuwekwa kwa
mps
ili kuendesha PyTorch kwenye Metal, lakini utekelezaji wa PyTorch Metal sio wa kina.
LLM nyingi za chanzo huria hutumika tu na maandishi na utoaji wa maandishi. Walakini, kwa kuwa tunataka kuunda mfumo wa kutoa sauti ndani ya sauti, hii ingetuhitaji kutumia miundo miwili zaidi (1) kubadilisha hotuba kuwa maandishi kabla ya kulishwa kuwa LLM yetu na (2) kubadilisha pato la LLM kurudi. kwenye hotuba.
Kwa kutumia LLM yenye miundo mingi kama vile Sauti ya Qwen, tunaweza kuepuka modeli moja ya kuchakata ingizo la usemi kuwa jibu la maandishi, na kisha tu kutumia muundo wa pili kubadilisha towe la LLM kuwa usemi.
Mbinu hii ya aina nyingi sio tu ya ufanisi zaidi katika suala la wakati wa usindikaji na (V) matumizi ya RAM, lakini pia kwa kawaida hutoa matokeo bora zaidi kwa vile sauti ya ingizo hutumwa moja kwa moja kwa LLM bila msuguano wowote.
Ikiwa unatumia seva pangishi ya GPU ya wingu kama Runpod au Vast , utataka kuweka saraka za HuggingFace home & Bark kwenye hifadhi yako ya sauti kwa kuendesha
export HF_HOME=/workspace/hf
&export XDG_CACHE_HOME=/workspace/bark
kabla ya kupakua. mifano.
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device)
Tulichagua kutumia kibadala kidogo cha 7B cha mfululizo wa modeli ya Sauti ya Qwen hapa ili kupunguza mahitaji yetu ya kimahesabu. Hata hivyo, Qwen anaweza kuwa ametoa mifano ya sauti yenye nguvu na kubwa zaidi wakati unasoma makala hii. Unaweza kutazama miundo yote ya Qwen kwenye HuggingFace ili kuangalia mara mbili kuwa unatumia muundo wao wa hivi punde.
Kwa mazingira ya uzalishaji, unaweza kutaka kutumia injini ya uelekezaji ya haraka kama vLLM kwa upitishaji wa juu zaidi.
Gome ni muundo wa kisasa wa AI wa chanzo huria wa maandishi-hadi-hotuba unaoauni lugha nyingi pamoja na madoido ya sauti.
from bark import SAMPLE_RATE, generate_audio, preload_models preload_models()
Kando na Gome, unaweza pia kutumia mifano mingine ya chanzo-wazi au ya umiliki ya maandishi-hadi-hotuba. Kumbuka kwamba ingawa wamiliki wanaweza kuwa watendaji zaidi, wanakuja kwa gharama kubwa zaidi. Uwanja wa TTS huhifadhi ulinganisho wa kisasa .
Na Qwen Audio 7B na Gome zote zikiwa zimepakiwa kwenye kumbukumbu, takriban (V) matumizi ya RAM ni 24GB, kwa hivyo hakikisha maunzi yako yanatumia hili. Vinginevyo, unaweza kutumia toleo la quantized la mfano wa Qwen kuokoa kwenye kumbukumbu.
Tutaunda seva ya FastAPI yenye njia mbili za kushughulikia maingizo ya sauti au maandishi yanayoingia na kurudisha majibu ya sauti.
from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn app = FastAPI() @app.post("/voice") async def voice_interaction(file: UploadFile): # TODO return @app.post("/text") async def text_interaction(text: str = Form(...)): # TODO return if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)
Seva hii inakubali faili za sauti kupitia maombi ya POST kwenye /voice
& /text
endpoint.
Tutatumia ffmpeg kuchakata sauti inayoingia na kuitayarisha kwa mfano wa Qwen.
from pydub import AudioSegment from io import BytesIO import numpy as np def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array
Kwa sauti iliyochakatwa, tunaweza kutoa jibu la maandishi kwa kutumia mfano wa Qwen. Hii itahitaji kushughulikia maandishi na sauti.
Kichakataji kitabadilisha ingizo letu hadi kiolezo cha gumzo cha modeli (ChatML katika kesi ya Qwen).
def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response
Jisikie huru kucheza na vigezo vya uzalishaji kama vile halijoto kwenye kitendakazi cha
model.generate
.
Hatimaye, tutabadilisha majibu ya maandishi yaliyotolewa kuwa matamshi.
from scipy.io.wavfile import write as write_wav def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer
Sasisha miisho ili kuchakata sauti au ingizo la maandishi, kutoa jibu, na kurudisha hotuba iliyosanisishwa kama faili ya WAV.
@app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav")
Unaweza kuchagua pia kuongeza ujumbe wa mfumo kwenye mazungumzo ili kupata udhibiti zaidi wa majibu ya mratibu.
Tunaweza kutumia curl
kuweka seva yetu kama ifuatavyo:
# Audio input curl -X POST http://localhost:8000/voice --output output.wav -F "[email protected]" # Text input curl -X POST http://localhost:8000/text --output output.wav -H "Content-Type: application/x-www-form-urlencoded" -d "text=Hey"
Kwa kufuata hatua hizi, umeweka seva rahisi ya ndani yenye uwezo wa mwingiliano wa sauti wa njia mbili kwa kutumia miundo ya hali ya juu. Usanidi huu unaweza kutumika kama msingi wa kuunda programu ngumu zaidi zinazowezeshwa na sauti.
Iwapo unatafuta njia za kuchuma mapato kwa miundo ya lugha inayoendeshwa na AI, zingatia programu hizi zinazowezekana:
import torch from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration from bark import SAMPLE_RATE, generate_audio, preload_models from scipy.io.wavfile import write as write_wav from pydub import AudioSegment from io import BytesIO import numpy as np device = 'cuda' if torch.cuda.is_available() else 'cpu' model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device) preload_models() app = FastAPI() def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer @app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)