在我最近发表关于如何构建自己的 RAG 并在本地运行它的帖子之后，今天，我们更进一步，不仅实现了大型语言模型的对话能力，还增加了听力和口语能力。这个想法很简单：我们将创建一个语音助手，让人想起标志性钢铁侠电影中的贾维斯或星期五，它可以在您的计算机上离线运行。 由于这是入门教程，我将使用 Python 实现它，并使其足够简单，适合初学者。最后，我将提供一些有关如何扩展应用程序的指导。 科技栈 首先，你应该设置一个虚拟 Python 环境。你有几个选项，包括 pyenv、virtualenv、poetry 和其他具有类似用途的选项。就我个人而言，我将在本教程中使用 Poetry，因为我的个人偏好。以下是你需要安装的几个关键库：    ：为了获得视觉上吸引人的控制台输出。 rich    ：一种强大的语音到文本转换工具。 openai-whisper    ：一个尖端的文本到语音合成库，可确保高质量的音频输出。 suno-bark    ：一个用于与大型语言模型（LLM）交互的简单库。 langchain    、  和  ：对于音频录制和播放至关重要。 sounddevice pyaudio Speechrecognition 有关依赖项的详细列表，请参阅 链接。 此处的 这里最关键的组件是大型语言模型 (LLM) 后端，我们将使用 Ollama。Ollama 被广泛认为是一种流行的离线运行和服务 LLM 的工具。如果您不熟悉  ，我建议您查看我之前关于离线 RAG 的文章：  基本上，您只需下载 Ollama 应用程序，提取您喜欢的模型，然后运行它即可。 Ollama “构建您自己的 RAG 并在本地运行它：Langchain + Ollama + Streamlit”。 建筑学 好的，如果一切都已设置好，让我们继续下一步。下面是我们应用程序的总体架构，它基本上包含 3 个主要组件： ：利用  ，我们将口语转换为文本。Whisper 对各种数据集的训练确保了其对各种语言和方言的熟练掌握。 语音识别 OpenAI 的 Whisper ：对于对话功能，我们将使用 模型的 Langchain 接口，该接口由 Ollama 提供。此设置可确保无缝且引人入胜的对话流程。 对话链 Llama-2 ：文本到语音的转换是通过 实现的，Bark 是 Suno AI 推出的最先进的模型，以逼真的语音生成而闻名。 语音合成器 Bark 工作流程很简单：录制语音、转录为文本、使用 LLM 生成响应，然后使用 Bark 发出响应的声音。   Whisper、Ollama 和 Bark 语音助手的序列图。 执行 实现首先要基于 Bark 制作一个  ，结合从文本合成语音的方法以及无缝处理较长的文本输入的方法，如下所示： TextToSpeechService   import nltk import torch import warnings import numpy as np from transformers import AutoProcessor, BarkModel warnings.filterwarnings( "ignore", message="torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.", ) class TextToSpeechService: def __init__(self, device: str = "cuda" if torch.cuda.is_available() else "cpu"): """ Initializes the TextToSpeechService class. Args: device (str, optional): The device to be used for the model, either "cuda" if a GPU is available or "cpu". Defaults to "cuda" if available, otherwise "cpu". """ self.device = device self.processor = AutoProcessor.from_pretrained("suno/bark-small") self.model = BarkModel.from_pretrained("suno/bark-small") self.model.to(self.device) def synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"): """ Synthesizes audio from the given text using the specified voice preset. Args: text (str): The input text to be synthesized. voice_preset (str, optional): The voice preset to be used for the synthesis. Defaults to "v2/en_speaker_1". Returns: tuple: A tuple containing the sample rate and the generated audio array. """ inputs = self.processor(text, voice_preset=voice_preset, return_tensors="pt") inputs = {k: v.to(self.device) for k, v in inputs.items()} with torch.no_grad(): audio_array = self.model.generate(**inputs, pad_token_id=10000) audio_array = audio_array.cpu().numpy().squeeze() sample_rate = self.model.generation_config.sample_rate return sample_rate, audio_array def long_form_synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"): """ Synthesizes audio from the given long-form text using the specified voice preset. Args: text (str): The input text to be synthesized. voice_preset (str, optional): The voice preset to be used for the synthesis. Defaults to "v2/en_speaker_1". Returns: tuple: A tuple containing the sample rate and the generated audio array. """ pieces = [] sentences = nltk.sent_tokenize(text) silence = np.zeros(int(0.25 * self.model.generation_config.sample_rate)) for sent in sentences: sample_rate, audio_array = self.synthesize(sent, voice_preset) pieces += [audio_array, silence.copy()] return self.model.generation_config.sample_rate, np.concatenate(pieces)      ：该类采用可选的 参数，该参数指定要用于模型的设备（如果有 GPU，则为  ，否则为  ）。它从 预训练模型加载 Bark 模型和相应的处理器。您还可以通过为模型加载器指定 来使用大型版本。 初始化 ( __init__ ) device cuda cpu suno/bark-small suno/bark      ：此方法接受 输入和 参数，该参数指定用于合成的语音。您可以 查看其他 值。它使用 准备输入文本和语音预设，然后使用 方法生成音频数组。生成的音频数组将转换为 NumPy 数组，并将采样率与音频数组一起返回。 合成 ( synthesize ) text voice_preset 在此处 voice_preset processor model.generate()      ：此方法用于合成较长的文本输入。它首先使用 函数将输入文本标记为句子。对于每个句子，它调用 方法来生成音频数组。然后，它将生成的音频数组连接起来，并在每个句子之间添加短暂的静音（0.25 秒）。 长格式合成 ( long_form_synthesize ) nltk.sent_tokenize synthesize 现在我们已经设置了  ，我们需要为大型语言模型 (LLM) 服务准备 Ollama 服务器。为此，您需要遵循以下步骤： TextToSpeechService ：运行以下命令从 Ollama 存储库下载最新的 Llama-2 模型：    拉取最新的 Llama-2 模型 ollama pull llama2 。 ：如果服务器尚未启动，请执行以下命令启动它：   。 启动 Ollama 服务器 ollama serve 完成这些步骤后，您的应用程序将能够使用 Ollama 服务器和 Llama-2 模型来生成对用户输入的响应。 接下来，我们将转到主要应用程序逻辑。首先，我们需要初始化以下组件： ：我们将使用丰富的库为终端内的用户创建更好的交互式控制台。 丰富的控制台   ：我们将初始化 Whisper 语音识别模型，这是 OpenAI 开发的最先进的开源语音识别系统。我们将使用基础英语模型 (   ) 转录用户输入。 Whisper 语音转文本 base.en   ：我们将初始化一个 Bark 文本到语音合成器实例，该实例已在上面实现。 Bark 文本到语音 ：我们将使用 Langchain 库中的内置  ，它提供了管理对话流的模板。我们将配置它以使用 Llama-2 语言模型和 Ollama 后端。 对话链 ConversationalChain   import time import threading import numpy as np import whisper import sounddevice as sd from queue import Queue from rich.console import Console from langchain.memory import ConversationBufferMemory from langchain.chains import ConversationChain from langchain.prompts import PromptTemplate from langchain_community.llms import Ollama from tts import TextToSpeechService console = Console() stt = whisper.load_model("base.en") tts = TextToSpeechService() template = """ You are a helpful and friendly AI assistant. You are polite, respectful, and aim to provide concise responses of less than 20 words. The conversation transcript is as follows: {history} And here is the user's follow-up: {input} Your response: """ PROMPT = PromptTemplate(input_variables=["history", "input"], template=template) chain = ConversationChain( prompt=PROMPT, verbose=False, memory=ConversationBufferMemory(ai_prefix="Assistant:"), llm=Ollama(), ) 现在，让我们定义必要的函数：    ：此函数在单独的线程中运行，使用 从用户的麦克风捕获音频数据。每当有新的音频数据可用时，就会调用回调函数，并将数据放入 以供进一步处理。 record_audio sounddevice.RawInputStream data_queue    ：该函数利用 Whisper 实例将 中的音频数据转录为文本。 transcribe data_queue    ：此函数将当前对话上下文提供给 Llama-2 语言模型（通过 Langchain   ）并检索生成的文本响应。 get_llm_response ConversationalChain    ：此函数采用 Bark 文本转语音引擎生成的音频波形，并使用声音播放库（例如  ）将其播放给用户。 play_audio sounddevice   def record_audio(stop_event, data_queue): """ Captures audio data from the user's microphone and adds it to a queue for further processing. Args: stop_event (threading.Event): An event that, when set, signals the function to stop recording. data_queue (queue.Queue): A queue to which the recorded audio data will be added. Returns: None """ def callback(indata, frames, time, status): if status: console.print(status) data_queue.put(bytes(indata)) with sd.RawInputStream( samplerate=16000, dtype="int16", channels=1, callback=callback ): while not stop_event.is_set(): time.sleep(0.1) def transcribe(audio_np: np.ndarray) -> str: """ Transcribes the given audio data using the Whisper speech recognition model. Args: audio_np (numpy.ndarray): The audio data to be transcribed. Returns: str: The transcribed text. """ result = stt.transcribe(audio_np, fp16=False) # Set fp16=True if using a GPU text = result["text"].strip() return text def get_llm_response(text: str) -> str: """ Generates a response to the given text using the Llama-2 language model. Args: text (str): The input text to be processed. Returns: str: The generated response. """ response = chain.predict(input=text) if response.startswith("Assistant:"): response = response[len("Assistant:") :].strip() return response def play_audio(sample_rate, audio_array): """ Plays the given audio data using the sounddevice library. Args: sample_rate (int): The sample rate of the audio data. audio_array (numpy.ndarray): The audio data to be played. Returns: None """ sd.play(audio_array, sample_rate) sd.wait() 然后，我们定义主应用程序循环。主应用程序循环引导用户完成对话交互，如下所示： 提示用户按 Enter 开始记录他们的输入。 一旦用户按下 Enter 键，就会在单独的线程中调用 函数来捕获用户的音频输入。 record_audio 当用户再次按下 Enter 停止录音时，音频数据将使用 功能进行转录。 transcribe 然后将转录的文本传递给 函数，该函数使用 Llama-2 语言模型生成响应。 get_llm_response 生成的响应被打印到控制台并使用 函数播放给用户。 play_audio   if __name__ == "__main__": console.print("[cyan]Assistant started! Press Ctrl+C to exit.") try: while True: console.input( "Press Enter to start recording, then press Enter again to stop." ) data_queue = Queue() # type: ignore[var-annotated] stop_event = threading.Event() recording_thread = threading.Thread( target=record_audio, args=(stop_event, data_queue), ) recording_thread.start() input() stop_event.set() recording_thread.join() audio_data = b"".join(list(data_queue.queue)) audio_np = ( np.frombuffer(audio_data, dtype=np.int16).astype(np.float32) / 32768.0 ) if audio_np.size > 0: with console.status("Transcribing...", spinner="earth"): text = transcribe(audio_np) console.print(f"[yellow]You: {text}") with console.status("Generating response...", spinner="earth"): response = get_llm_response(text) sample_rate, audio_array = tts.long_form_synthesize(response) console.print(f"[cyan]Assistant: {response}") play_audio(sample_rate, audio_array) else: console.print( "[red]No audio recorded. Please ensure your microphone is working." ) except KeyboardInterrupt: console.print("\n[red]Exiting...") console.print("[blue]Session ended.") 结果 一切准备就绪后，我们就可以运行该应用程序，如上面的视频所示。由于 Bark 模型很大，即使是较小版本，该应用程序在我的 MacBook 上运行速度也相当慢。因此，我稍微加快了视频速度。对于使用支持 CUDA 的计算机的用户，它可能会运行得更快。以下是我们应用程序的主要功能： ：用户可以开始和停止录制他们的语音输入，助手通过播放生成的音频来做出响应。 基于语音的交互 助手保留对话上下文，从而能够做出更连贯、更相关的响应。使用 Llama-2 语言模型，助手能够提供简洁、有针对性的响应。 对话上下文： 对于想要将此应用程序提升到生产就绪状态的用户，建议进行以下增强： ：结合模型的优化版本，例如 whisper.cpp、llama.cpp 和 bark.cpp，旨在提高性能，尤其是在低端计算机上。 性能优化 ：实施一个系统，允许用户自定义机器人的角色和提示，从而可以创建不同类型的助手（例如个人、专业或特定领域）。 可定制的机器人提示  ：开发用户友好的 GUI 以增强整体用户体验，使应用程序更易于访问且更具视觉吸引力。 图形用户界面 (GUI) ：扩展应用程序以支持多模式交互，例如除了基于语音的响应之外，还能够生成和显示图像、图表或其他视觉内容。 多模式功能 最后，我们完成了简单的语音助手应用程序，完整代码可在以下网址找到：   。语音识别、语言建模和文本转语音技术的结合展示了我们如何构建听起来很难但实际上可以在计算机上运行的东西。让我们享受编码的乐趣，别忘了订阅 这样你就不会错过最新的人工智能和编程文章。 https://github.com/vndee/local-talking-llm 我的博客， 也发布 在这里

Read My Stories

該音頻是用故事的原始語言製作的！

如何构建您自己的语音助手并使用 Whisper + Ollama + Bark 在本地运行

About Author

註釋

標籤

这篇文章刊登在

Related Stories

Telegram：加密岛通往大陆的桥梁

创建以用户为中心的加密产品：客户反馈的重要性

想赢得 HackerNoon 写作比赛吗？以下是 #crypto-api 比赛获奖者的推荐

Floki 的 Valhalla 成为印度环斯里兰卡赛事联合赞助商

Telegram：加密岛通往大陆的桥梁

创建以用户为中心的加密产品：客户反馈的重要性

想赢得 HackerNoon 写作比赛吗？以下是 #crypto-api 比赛获奖者的推荐

Floki 的 Valhalla 成为印度环斯里兰卡赛事联合赞助商

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps