LLMlarning ovozli imkoniyatlar bilan integratsiyalashuvi mijozlarning shaxsiy o'zaro munosabatlarida yangi imkoniyatlar yaratdi.
Ushbu qo'llanma Python, Transformers, Qwen2-Audio-7B-Instruct va Bark yordamida ikki tomonlama ovozli o'zaro ta'sirlarni qo'llab-quvvatlaydigan mahalliy LLM serverini sozlash bo'yicha sizga yo'l beradi.
Ishni boshlashdan oldin sizda quyidagilar o'rnatilgan bo'ladi:
FFmpeg Linux-da apt install ffmpeg
yoki MacOS-da brew install ffmpeg
orqali o'rnatilishi mumkin.
Siz Python bog'liqliklarini pip yordamida o'rnatishingiz mumkin: pip install torch transformers accelerate pydub fastapi uvicorn bark python-multipart scipy
Birinchidan, Python muhitimizni sozlaymiz va PyTorch qurilmamizni tanlaymiz:
import torch device = 'cuda' if torch.cuda.is_available() else 'cpu'
Ushbu kod CUDA-mos (Nvidia) GPU mavjudligini tekshiradi va mos ravishda qurilmani o'rnatadi.
Agar bunday GPU mavjud bo'lmasa, PyTorch o'rniga ancha sekinroq protsessorda ishlaydi.
Yangi Apple Silicon qurilmalari uchun PyTorch-ni Metallda ishlatish uchun qurilma
mps
ga ham o'rnatilishi mumkin, ammo PyTorch Metal ilovasi keng qamrovli emas.
Ko'pgina ochiq manbali LLMlar faqat matn kiritish va matn chiqarishni qo'llab-quvvatlaydi. Biroq, biz ovozli ovoz chiqarish tizimini yaratmoqchi ekanmiz, buning uchun (1) nutqni LLMga o'tkazishdan oldin matnga aylantirish va (2) LLM chiqishini qayta o'zgartirish uchun yana ikkita modeldan foydalanishimiz kerak bo'ladi. nutqqa.
Qwen Audio kabi multimodal LLM dan foydalanib, biz nutqni matnli javobga qayta ishlash uchun bitta modeldan xalos bo'lishimiz mumkin va keyin faqat LLM chiqishini nutqqa aylantirish uchun ikkinchi modeldan foydalanishimiz kerak.
Ushbu multimodal yondashuv nafaqat ishlov berish vaqti va (V) RAM iste'moli nuqtai nazaridan samaraliroq, balki odatda yaxshi natijalar beradi, chunki kirish ovozi hech qanday ishqalanishsiz to'g'ridan-to'g'ri LLMga yuboriladi.
Agar siz Runpod yoki Vast kabi bulutli GPU xostlarida ishlayotgan bo‘lsangiz, yuklab olishdan oldin
export HF_HOME=/workspace/hf
vaexport XDG_CACHE_HOME=/workspace/bark
ishga tushirish orqali HuggingFace uy va Bark kataloglarini hajm xotirasiga o‘rnatishni xohlaysiz. modellar.
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device)
Hisoblash talablarimizni kamaytirish uchun biz Qwen Audio modeli seriyasining kichik 7B variantidan foydalanishni tanladik. Biroq, siz ushbu maqolani o'qiyotganingizda Qwen kuchliroq va kattaroq audio modellarni chiqargan bo'lishi mumkin. Siz barcha Qwen modellarini HuggingFace’da ko‘rishingiz mumkin va ularning so‘nggi modelidan foydalanayotganingizni ikki marta tekshirishingiz mumkin.
Ishlab chiqarish muhiti uchun siz ancha yuqori o'tkazuvchanlik uchun vLLM kabi tezkor xulosa chiqarish mexanizmidan foydalanishni xohlashingiz mumkin.
Bark - bu bir nechta tillarni hamda ovoz effektlarini qo'llab-quvvatlaydigan eng zamonaviy ochiq manbali matndan nutqqa AI modeli.
from bark import SAMPLE_RATE, generate_audio, preload_models preload_models()
Bark-dan tashqari, siz boshqa ochiq manbali yoki xususiy matndan nutqqa modellaridan ham foydalanishingiz mumkin. Shuni yodda tutingki, mulkdorlar yanada samaraliroq bo'lishi mumkin bo'lsa-da, ular ancha yuqori narxga ega. TTS arenasi zamonaviy taqqoslashni davom ettiradi .
Qwen Audio 7B va Bark xotiraga o‘rnatilganda, taxminiy (V) RAMdan foydalanish 24 Gbni tashkil qiladi, shuning uchun uskunangiz buni qo‘llab-quvvatlashiga ishonch hosil qiling. Aks holda, xotirani tejash uchun Qwen modelining kvantlangan versiyasidan foydalanishingiz mumkin .
Kiruvchi audio yoki matn kiritishlarni boshqarish va audio javoblarni qaytarish uchun ikkita marshrutga ega FastAPI serverini yaratamiz.
from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn app = FastAPI() @app.post("/voice") async def voice_interaction(file: UploadFile): # TODO return @app.post("/text") async def text_interaction(text: str = Form(...)): # TODO return if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)
Ushbu server audio fayllarni POST so'rovlari orqali /voice
& /text
oxirgi nuqtasida qabul qiladi.
Kiruvchi audioni qayta ishlash va uni Qwen modeliga tayyorlash uchun ffmpeg dan foydalanamiz.
from pydub import AudioSegment from io import BytesIO import numpy as np def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array
Qayta ishlangan audio bilan biz Qwen modeli yordamida matnli javob yaratishimiz mumkin. Bu matn va audio kiritishni boshqarishi kerak.
Preprotsessor bizning ma'lumotimizni modelning chat shabloniga o'zgartiradi (Qwen misolida ChatML).
def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response
model.generate
funksiyasidagi harorat kabi avlod parametrlari bilan bemalol o‘ynang.
Nihoyat, yaratilgan matn javobini nutqqa aylantiramiz.
from scipy.io.wavfile import write as write_wav def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer
Ovoz yoki matn kiritishni qayta ishlash, javob yaratish va sintezlangan nutqni WAV fayli sifatida qaytarish uchun oxirgi nuqtalarni yangilang.
@app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav")
Yordamchi javoblari ustidan ko‘proq nazoratni qo‘lga kiritish uchun suhbatlarga tizim xabarini qo‘shishni ham tanlashingiz mumkin.
Serverimizga ping yuborish uchun curl
quyidagi tarzda foydalanishimiz mumkin:
# Audio input curl -X POST http://localhost:8000/voice --output output.wav -F "[email protected]" # Text input curl -X POST http://localhost:8000/text --output output.wav -H "Content-Type: application/x-www-form-urlencoded" -d "text=Hey"
Ushbu bosqichlarni bajarish orqali siz zamonaviy modellardan foydalangan holda ikki tomonlama ovozli o'zaro aloqalarni amalga oshirishga qodir oddiy mahalliy serverni o'rnatdingiz. Ushbu sozlash yanada murakkab ovozli ilovalarni yaratish uchun asos bo'lib xizmat qilishi mumkin.
Agar siz sun'iy intellektga asoslangan til modellarini monetizatsiya qilish yo'llarini o'rganayotgan bo'lsangiz, quyidagi potentsial ilovalarni ko'rib chiqing:
import torch from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration from bark import SAMPLE_RATE, generate_audio, preload_models from scipy.io.wavfile import write as write_wav from pydub import AudioSegment from io import BytesIO import numpy as np device = 'cuda' if torch.cuda.is_available() else 'cpu' model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device) preload_models() app = FastAPI() def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer @app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)