使用 Llama.cpp 掌握本地 Llama 推理：高性能设置的 5 个基本步骤

Rifx.Online
Large Language Models , Programming , Best Practices
26 Feb, 2025

Llama.cpp

Llama.cpp 是一个强大且高效的推理框架，用于在您的机器上本地运行 LLaMA 模型。与 Ollama、LM Studio 和类似的 LLM 服务解决方案不同，Llama.cpp 旨在提供高性能、低资源的推理，同时为不同的硬件架构提供灵活性。

特性

高性能推理
低资源消耗
对不同硬件架构的灵活性

安装

要安装 Llama.cpp，请按照以下步骤操作：

克隆代码库：

git clone https://github.com/your-repo/llama.cpp.git

导航到目录：
```
cd llama.cpp
```
安装所需的依赖项：
```
make install
```

使用

要使用 Llama.cpp，您可以运行以下命令：

./llama --model your_model_path

示例代码

以下是加载模型并进行推理的示例：

#include "llama.h"

int main() {
    LlamaModel model("your_model_path");
    model.load();
    auto result = model.infer("input text");
    return 0;
}

结论

Llama.cpp 提供了一种以最小资源使用运行 LLaMA 模型的高效方式，使其成为本地推理任务的合适选择。

介绍

您是否曾想在自己的机器上运行大型语言模型（LLMs），而不依赖于云服务？Llama.cpp 使这一切成为可能！这个轻量级但强大的框架能够实现 LLaMA 模型的高性能本地推理，让您对执行、性能和优化拥有完全的控制权。

在本指南中，我们将带您完成安装 Llama.cpp、设置模型、运行推理以及通过 Python 和 HTTP API 进行交互的过程。无论您是 AI 研究人员、开发者还是爱好者，本教程将帮助您轻松入门本地 LLMs。

为什么选择 Llama.cpp?

在安装之前，让我们将 Llama.cpp 与其他解决方案进行比较：

Llama.cpp vs. Ollama：虽然 Ollama 提供内置的模型管理和用户友好的体验，但 Llama.cpp 让您对模型执行和硬件加速有完全的控制。
Llama.cpp vs. LM Studio：LM Studio 具有图形用户界面，而 Llama.cpp 旨在用于命令行界面和脚本自动化，使其非常适合高级用户。

Llama.cpp 的主要优势：

针对 CPU 推理进行了优化，同时支持 GPU 加速。
可在 Windows、Linux 和 macOS 上运行。
允许对执行进行精细控制，包括服务器模式和 Python 集成。

现在，让我们开始在您的系统上设置 Llama.cpp。

安装指南

有关详细的构建说明，请参阅官方指南：Llama.cpp 构建说明。在接下来的部分中，我将解释您可以从 llama.cpp github 存储库下载的不同预构建二进制文件以及如何在您的机器上安装它们。

Windows 设置

选择正确的二进制文件

如果您从 Llama.cpp 的发布页面下载预构建的二进制文件 [链接]，请根据您的 CPU 和 GPU 能力进行选择：

AVX (llama-bin-win-avx-x64.zip)：适用于支持 AVX 的旧版 CPU。
AVX2 (llama-bin-win-avx2-x64.zip)：适用于 Intel Haswell（2013 年）及更高版本。
AVX-512 (llama-bin-win-avx512-x64.zip)：适用于 Intel Skylake-X 及更新版本。
CUDA (llama-bin-win-cuda-cu11.7-x64.zip)：如果使用 NVIDIA GPU。

如果不确定，请从 AVX2 开始，因为大多数现代 CPU 都支持它。对于 GPU，请确保您的 CUDA 驱动程序版本与二进制文件匹配。

llama.cpp 发布工件。

对于本教程，我的 PC 上安装了 CUDA 12.4，因此我下载了 llama-b4676-bin-win-cuda-cu12.4-x64.zip 和 cudart-llama-bin-win-cu12.4-x64.zip，解压后将二进制文件放置在一个目录中，并将该目录添加到我的 path 环境变量中。

Linux & macOS 设置

对于 Linux 和 macOS，请下载适当的二进制文件：

Linux: llama-bin-ubuntu-x64.zip
macOS (Intel): llama-bin-macos-x64.zip
macOS (Apple Silicon M1/M2): llama-bin-macos-arm64.zip

下载后，解压文件并将目录添加到系统的 PATH 中，以便全局执行命令。

您还可以在 Linux 中使用以下安装方法使用 curl：

curl -fsSL https://ollama.com/install.sh | sh

在下载正确的文件、解压并将提取的目录添加到系统的环境变量中以便从任何位置运行可执行文件后，现在我们准备探索 llama.cpp 的功能。

理解 GGUF、GGML、Hugging Face 和 LoRA 格式

什么是 GGUF？

GGUF (Generalized GGML Unified Format) 是一种优化的文件格式，旨在使用 Llama.cpp 和其他框架高效运行大型语言模型。通过标准化模型权重和元数据的存储方式，它提高了兼容性和性能，从而允许在不同的硬件架构上进行高效推理。

什么是 GGML？

GGML (Generalized Gradient Model Language) 是一种用于 LLM 推理的早期格式，支持量化模型，使其在内存使用上更加高效。然而，由于 GGUF 具有增强的功能和改进的性能，GGML 已在很大程度上被 GGUF 替代。

将 GGML 转换为 GGUF

如果您有一个 GGML 模型并需要与 Llama.cpp 一起使用，您可以使用转换脚本将其转换为 GGUF。

示例命令：

python convert_llama_ggml_to_gguf.py -input model.ggml -output model.gguf

convert_llama_ggml_to_gguf.py 脚本位于 llama.cpp 的 GitHub 仓库主目录中。

Hugging Face Format

Hugging Face models are typically stored in PyTorch (.bin or .safetensors) format. 这些模型可以使用转换脚本如 convert_hf_to_gguf.py 转换为 GGUF 格式。

LoRA 格式

LoRA (低秩适配) 是一种微调技术，用于高效地将大型语言模型适应于特定任务。LoRA 适配器仅存储微调的权重差异，而不是修改整个模型。要将 LoRA 与 Llama.cpp 一起使用，您可能需要在使用 convert_lora_to_gguf.py 转换为 GGUF 之前，将 LoRA 权重与基础模型合并。

从 Hugging Face 下载 GGUF 模型文件

您可以从 Hugging Face 下载 GGUF 模型文件，并使用它们与 Llama.cpp。请按照以下步骤操作：

访问 Hugging Face 模型页面：前往 Hugging Face 并搜索 LLaMA 或任何与 GGUF 兼容的模型。在本教程中，我们将使用从此链接下载的 mistral gguf 文件。
下载模型：导航到模型的代码库并下载该模型的 GGUF 版本。如果没有 GGUF 格式，您可能需要按照之前的说明手动转换。
移动文件：将下载或转换的 GGUF 模型放入您的 models/ 目录。

运行模型

现在我们可以使用命令 llama-cli，这是我们下载的可执行文件之一，您可以检查所有可以与 llama-cli 命令一起使用的标志，以触发使用 gguf 文件的 llm 模型。

在 llama-cli 工具的帮助列表末尾，有两个触发文本生成和聊天的示例。

与 Llama.cpp 在 Python 中交互

llama-cpp-python 概述

llama-cpp-python 包提供了 Llama.cpp 的 Python 绑定，使用户能够：

在 Python 应用程序中加载和运行 LLaMA 模型。
使用 GGUF 模型执行文本生成任务。
自定义推理参数，如温度、top-k 和 top-p，以获得更可控的响应。
在 CPU 和 GPU 上高效运行模型（如果启用了 CUDA）。
将模型托管为 API 服务器，便于集成到应用程序中。

安装所需的包

您可以使用 llama-cpp-python，它提供了 Llama.cpp 的 Python 绑定：

pip install llama-cpp-python

在 Python 中运行推理

现在我们可以使用上面下载的 llm 模型 gguf 文件，使用 llama_cpp 包在 Python 中加载它并触发聊天完成函数。

from llama_cpp import Llama

llm = Llama(model_path="mistral-7b-instruct-v0.2.Q2_K.gguf")
response = llm.create_chat_completion(
  messages=[
    {
        "role": "user",
        "content": "how big is the sky"
    }
])
print(response)

响应将类似于

{
  'id': 'chatcmpl-e8879677-7335-464a-803b-30a15d68c015', 
  'object': 'chat.completion', 
  'created': 1739218403, 
  'model': 'mistral-7b-instruct-v0.2.Q2_K.gguf', 
  'choices': [
    {
      'index': 0, 
      'message': 
        {
          'role': 'assistant', 
          'content': ' The size of the sky is not something that can be measured in a way that 
          is meaningful to us, as it is not a physical object with defined dimensions. 
          The sky is the expanse above the Earth, and it includes the atmosphere and the outer 
          space beyond. It goes on forever in all directions, as far as our current understanding 
          of the universe extends. So, we cannot assign a specific size to the sky. 
          Instead, we can describe the size of specific parts of the universe, such as the diameter 
          of a star or the distance between two galaxies.'
        }, 
        'logprobs': None, 
        'finish_reason': 'stop'
    }
  ], 
  'usage': {
    'prompt_tokens': 13, 
    'completion_tokens': 112, 
    'total_tokens': 125
    }
}

下载和使用 GGUF 模型与 Llama.from_pretrained

Llama.from_pretrained 方法允许用户直接从 Hugging Face 下载 GGUF 模型，并在不手动下载文件的情况下使用它们。

示例：

from llama_cpp import Llama

llm = Llama.from_pretrained(
   repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF", 
   filename="mistral-7b-instruct-v0.2.Q4_K_M.gguf"
)
response = llm.create_chat_completion(
  messages=[
    {"role": "user", "content": "黑洞是如何运作的？"}
  ]
)
print(response)

该方法通过自动下载和加载所需模型到内存中，简化了过程，消除了手动将 GGUF 文件放置在目录中的需要。

{
  'id': 'chatcmpl-e8879677-7335-464a-803b-30a15d68c015', 
  'object': 'chat.completion', 
  'created': 1739218403, 
  'model': 'mistral-7b-instruct-v0.2.Q2_K.gguf', 
  'choices': [
    {
      'index': 0, 
      'message': 
        {
          'role': 'assistant', 
          'content': '天空的大小不是可以以对我们有意义的方式来衡量的，因为它不是一个具有定义尺寸的物理对象。天空是地球上方的广阔空间，包括大气层和更远的外层空间。它在所有方向上无限延伸，直到我们对宇宙的当前理解为止。因此，我们无法给天空分配一个特定的大小。相反，我们可以描述宇宙特定部分的大小，例如恒星的直径或两个星系之间的距离。'
        }, 
        'logprobs': None, 
        'finish_reason': 'stop'
    }
  ], 
  'usage': {
    'prompt_tokens': 13, 
    'completion_tokens': 112, 
    'total_tokens': 125
  }
}

您可以使用 cache_dir 参数指定模型将被下载和缓存的目录。

作为服务器运行 Llama.cpp

您可以将 llama.cpp 作为服务器运行，并通过 API 调用与之交互。

启动服务器

llama-server -m mistral-7b-instruct-v0.2.Q2_K.gguf

在终端中将模型作为服务器启动会得到以下响应。

使用 Python 发送请求

import requests

url = "http://localhost:8000/completion"

payload = {
    "model": "mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    "prompt": "How big is the sky?",
    "temperature": 0.7,
    "max_tokens": 50
}

headers = {"Content-Type": "application/json"}
try:
    response = requests.post(url, json=payload, headers=headers)

    if response.status_code == 200:
        response_data = response.json()

        choices = response_data.get("choices", [])
        if choices:
            result = choices[0].get("text", "")
            print("Response:", result)
        else:
            print("No choices found in the response.")
    else:
        print(f"Request failed with status code {response.status_code}: {response.text}")
except Exception as e:
    print(f"Error occurred: {e}")

响应将类似于

Response: The sky is not a tangible object and does not have physical dimensions, so it cannot be measured or quantified in the same way that we measure and quantify objects with size or dimensions. The sky is simply the vast expanse of

从终端 (Linux/macOS) 或 PowerShell (Windows) 发送请求

curl -X POST "http://localhost:8000/completion" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "Tell me a fun fact.", "max_tokens": 50}'

结论

本教程涵盖了在不同平台上安装、运行和与 Llama.cpp 进行交互的内容。您现在可以将 Llama 模型集成到您的应用程序中，以进行本地推理和基于 API 的交互。