paint-brush
Ikki tomonlama ovozli chat orqali shaxsiy AI-ni joylashtirish siz o'ylagandan ham osonroq!tomonidan@herahavenai
Yangi tarix

Ikki tomonlama ovozli chat orqali shaxsiy AI-ni joylashtirish siz o'ylagandan ham osonroq!

tomonidan HeraHaven AI10m2025/01/08
Read on Terminal Reader

Juda uzoq; O'qish

Ushbu qo'llanma Python, Transformers, Qwen2-Audio-7B-Instruct va Bark yordamida ikki tomonlama ovozli o'zaro ta'sirlarni qo'llab-quvvatlaydigan mahalliy LLM serverini sozlash bo'yicha sizga yo'l beradi.
featured image - Ikki tomonlama ovozli chat orqali shaxsiy AI-ni joylashtirish siz o'ylagandan ham osonroq!
HeraHaven AI HackerNoon profile picture

LLMlarning ovozli imkoniyatlar bilan integratsiyalashuvi mijozlarning shaxsiy o'zaro munosabatlarida yangi imkoniyatlar yaratdi.


Ushbu qo'llanma Python, Transformers, Qwen2-Audio-7B-Instruct va Bark yordamida ikki tomonlama ovozli o'zaro ta'sirlarni qo'llab-quvvatlaydigan mahalliy LLM serverini sozlash bo'yicha sizga yo'l beradi.

Old shartlar

Ishni boshlashdan oldin sizda quyidagilar o'rnatilgan bo'ladi:

  • Python : 3.9 yoki undan yuqori versiya.
  • PyTorch : Modellarni ishga tushirish uchun.
  • Transformatorlar : Qwen modeliga kirishni ta'minlaydi.
  • Tezlashtirish : Ba'zi muhitlarda talab qilinadi.
  • FFmpeg & pydub : Ovozni qayta ishlash uchun.
  • FastAPI : Veb-server yaratish uchun.
  • Uvicorn : FastAPI-ni ishga tushirish uchun ASGI serveri.
  • Bark : Matnni nutqqa sintez qilish uchun.
  • Multipart & Scipy : Ovozni boshqarish uchun.


FFmpeg Linux-da apt install ffmpeg yoki MacOS-da brew install ffmpeg orqali o'rnatilishi mumkin.


Siz Python bog'liqliklarini pip yordamida o'rnatishingiz mumkin: pip install torch transformers accelerate pydub fastapi uvicorn bark python-multipart scipy

1-qadam: Atrof-muhitni o'rnatish

Birinchidan, Python muhitimizni sozlaymiz va PyTorch qurilmamizni tanlaymiz:


 import torch device = 'cuda' if torch.cuda.is_available() else 'cpu'


Ushbu kod CUDA-mos (Nvidia) GPU mavjudligini tekshiradi va mos ravishda qurilmani o'rnatadi.


Agar bunday GPU mavjud bo'lmasa, PyTorch o'rniga ancha sekinroq protsessorda ishlaydi.


Yangi Apple Silicon qurilmalari uchun PyTorch-ni Metallda ishlatish uchun qurilma mps ga ham o'rnatilishi mumkin, ammo PyTorch Metal ilovasi keng qamrovli emas.

2-qadam: Modelni yuklash

Ko'pgina ochiq manbali LLMlar faqat matn kiritish va matn chiqarishni qo'llab-quvvatlaydi. Biroq, biz ovozli ovoz chiqarish tizimini yaratmoqchi ekanmiz, buning uchun (1) nutqni LLMga o'tkazishdan oldin matnga aylantirish va (2) LLM chiqishini qayta o'zgartirish uchun yana ikkita modeldan foydalanishimiz kerak bo'ladi. nutqqa.


Qwen Audio kabi multimodal LLM dan foydalanib, biz nutqni matnli javobga qayta ishlash uchun bitta modeldan xalos bo'lishimiz mumkin va keyin faqat LLM chiqishini nutqqa aylantirish uchun ikkinchi modeldan foydalanishimiz kerak.


Ushbu multimodal yondashuv nafaqat ishlov berish vaqti va (V) RAM iste'moli nuqtai nazaridan samaraliroq, balki odatda yaxshi natijalar beradi, chunki kirish ovozi hech qanday ishqalanishsiz to'g'ridan-to'g'ri LLMga yuboriladi.


Agar siz Runpod yoki Vast kabi bulutli GPU xostlarida ishlayotgan bo‘lsangiz, yuklab olishdan oldin export HF_HOME=/workspace/hf va export XDG_CACHE_HOME=/workspace/bark ishga tushirish orqali HuggingFace uy va Bark kataloglarini hajm xotirasiga o‘rnatishni xohlaysiz. modellar.


 from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device)


Hisoblash talablarimizni kamaytirish uchun biz Qwen Audio modeli seriyasining kichik 7B variantidan foydalanishni tanladik. Biroq, siz ushbu maqolani o'qiyotganingizda Qwen kuchliroq va kattaroq audio modellarni chiqargan bo'lishi mumkin. Siz barcha Qwen modellarini HuggingFace’da ko‘rishingiz mumkin va ularning so‘nggi modelidan foydalanayotganingizni ikki marta tekshirishingiz mumkin.


Ishlab chiqarish muhiti uchun siz ancha yuqori o'tkazuvchanlik uchun vLLM kabi tezkor xulosa chiqarish mexanizmidan foydalanishni xohlashingiz mumkin.

3-qadam: Bark modelini yuklash

Bark - bu bir nechta tillarni hamda ovoz effektlarini qo'llab-quvvatlaydigan eng zamonaviy ochiq manbali matndan nutqqa AI modeli.


 from bark import SAMPLE_RATE, generate_audio, preload_models preload_models()


Bark-dan tashqari, siz boshqa ochiq manbali yoki xususiy matndan nutqqa modellaridan ham foydalanishingiz mumkin. Shuni yodda tutingki, mulkdorlar yanada samaraliroq bo'lishi mumkin bo'lsa-da, ular ancha yuqori narxga ega. TTS arenasi zamonaviy taqqoslashni davom ettiradi .


Qwen Audio 7B va Bark xotiraga o‘rnatilganda, taxminiy (V) RAMdan foydalanish 24 Gbni tashkil qiladi, shuning uchun uskunangiz buni qo‘llab-quvvatlashiga ishonch hosil qiling. Aks holda, xotirani tejash uchun Qwen modelining kvantlangan versiyasidan foydalanishingiz mumkin .

4-qadam: FastAPI serverini sozlash

Kiruvchi audio yoki matn kiritishlarni boshqarish va audio javoblarni qaytarish uchun ikkita marshrutga ega FastAPI serverini yaratamiz.


 from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn app = FastAPI() @app.post("/voice") async def voice_interaction(file: UploadFile): # TODO return @app.post("/text") async def text_interaction(text: str = Form(...)): # TODO return if __name__ == "__main__":  uvicorn.run(app, host="0.0.0.0", port=8000)


Ushbu server audio fayllarni POST so'rovlari orqali /voice & /text oxirgi nuqtasida qabul qiladi.

5-qadam: Ovozli kirishni qayta ishlash

Kiruvchi audioni qayta ishlash va uni Qwen modeliga tayyorlash uchun ffmpeg dan foydalanamiz.


 from pydub import AudioSegment from io import BytesIO import numpy as np def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array

6-qadam: Qwen bilan matnli javob yaratish

Qayta ishlangan audio bilan biz Qwen modeli yordamida matnli javob yaratishimiz mumkin. Bu matn va audio kiritishni boshqarishi kerak.


Preprotsessor bizning ma'lumotimizni modelning chat shabloniga o'zgartiradi (Qwen misolida ChatML).


 def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response


model.generate funksiyasidagi harorat kabi avlod parametrlari bilan bemalol o‘ynang.

7-qadam: Bark yordamida matnni nutqqa aylantirish

Nihoyat, yaratilgan matn javobini nutqqa aylantiramiz.


 from scipy.io.wavfile import write as write_wav def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer

8-qadam: API-lardagi hamma narsani birlashtirish

Ovoz yoki matn kiritishni qayta ishlash, javob yaratish va sintezlangan nutqni WAV fayli sifatida qaytarish uchun oxirgi nuqtalarni yangilang.


 @app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav")

Yordamchi javoblari ustidan ko‘proq nazoratni qo‘lga kiritish uchun suhbatlarga tizim xabarini qo‘shishni ham tanlashingiz mumkin.

9-qadam: narsalarni sinab ko'rish

Serverimizga ping yuborish uchun curl quyidagi tarzda foydalanishimiz mumkin:


 # Audio input curl -X POST http://localhost:8000/voice --output output.wav -F "[email protected]" # Text input curl -X POST http://localhost:8000/text --output output.wav -H "Content-Type: application/x-www-form-urlencoded" -d "text=Hey"

Xulosa

Ushbu bosqichlarni bajarish orqali siz zamonaviy modellardan foydalangan holda ikki tomonlama ovozli o'zaro aloqalarni amalga oshirishga qodir oddiy mahalliy serverni o'rnatdingiz. Ushbu sozlash yanada murakkab ovozli ilovalarni yaratish uchun asos bo'lib xizmat qilishi mumkin.

Ilovalar

Agar siz sun'iy intellektga asoslangan til modellarini monetizatsiya qilish yo'llarini o'rganayotgan bo'lsangiz, quyidagi potentsial ilovalarni ko'rib chiqing:

To'liq kod

 import torch from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration from bark import SAMPLE_RATE, generate_audio, preload_models from scipy.io.wavfile import write as write_wav from pydub import AudioSegment from io import BytesIO import numpy as np device = 'cuda' if torch.cuda.is_available() else 'cpu' model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device) preload_models() app = FastAPI() def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer @app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)