解锁 AI 潜力：使用 llama.cpp 在 CPU 上运行大型语言模型的 5 个简单步骤

Rifx.Online
Large Language Models , AI Applications , Programming
05 Mar, 2025

学习如何下载、交互式运行模型、在 Python 中使用它们，以及将它们公开为 REST API

传统上，在消费级硬件上运行大型语言模型 (LLM) 需要强大的 GPU，这使得许多开发人员和研究人员无法使用 AI 驱动的应用程序。然而，借助 llama.cpp，Meta 的 LLaMA 模型的优化 C++ 实现，现在可以在 CPU 上高效地运行 LLM，只需最少的资源。这个轻量级框架利用先进的量化技术和 CPU 专用优化，可以在笔记本电脑、台式机甚至边缘设备上实现流畅的推理。

在本文中，我们将探讨 llama.cpp 如何使基于 CPU 的 LLM 部署成为可能，它的主要特性，以及如何开始在本地运行 AI 模型，而无需依赖昂贵的硬件。

我将演示在 Mac 上的步骤。

支持的模型

尽管 llama.cpp 是 Meta 的 LLaMA 模型的优化 C++ 实现，但它也可以运行非 LLaMA 模型，只要它们被转换为 GGUF 格式（llama.cpp 使用的优化模型格式）。它现在支持各种基于 Transformer 的模型，例如：

Mistral — 高效、高性能的开放权重模型。
Gemma — 谷歌的轻量级模型，类似于 LLaMA。
Phi-2 — 微软的小型 LLM，针对推理任务进行了优化。
GPT 模型 — 某些版本的 GPT 模型（如 GPT-J 和 GPT-NeoX）可以转换和运行。
Starcoder — 针对代码生成和补全优化的模型。
StableLM — 来自 Stability AI 的开放权重语言模型。
OpenHermes — 专为聊天和指令遵循而设计的微调模型。

GGUF (GGML 统一格式) 是一种模型文件格式，专为在 llama.cpp 和类似的基于 CPU 的推理引擎中进行优化执行而设计。

安装 llama.cpp

要安装 llama.cpp，首先使用 brew 安装 cmake：

$ brew install cmake

如果您的 Mac 上未安装 brew，请访问 https://brew.sh/ 获取安装说明。

接下来，将 llama.cpp 克隆到您的本地计算机上并编译它：

$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp
$ mkdir build
$ cd build
$ cmake .. -DLLAMA_SERVER=ON
$ cmake --build . --config Release

下载并编译 llma.cpp 后，您应该有以下目录结构：

llama.cpp
    |___build
         |___bin
              |___llama-cli
              |___llama-server
    |___models
         |___ggml-vocab-aquila.gguf
         |___ggml-vocab-baichuan.gguf
         |___...
    |___...
    |___...

我只包含了本文中目录结构的相关部分。

我们将在后面的章节中讨论 llama-cli 和 llama-server 工具的使用。

手动下载模型

现在，让我们探讨如何下载模型并使用 llama.cpp 运行它。对于此示例，我们将使用托管在 Hugging Face 上的 GGUF 模型。要开始，请访问 Hugging Face 并搜索“gguf”模型：

您将看到匹配的 GGUF 模型列表。对于此示例，让我们使用位于 https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF 的 Mistral-7B-Instruct-v0.2-GGUF 模型：

进入模型的页面后，单击 Files and versions 选项卡，找到您要下载的特定 GGUF 文件。有两种获取模型的方法：

使用 URL: 右键单击模型文件，例如 mistral-7b-instruct-v0.2.Q4_K_M.gguf，然后选择“复制链接”。然后，您可以使用此 URL 使用 curl 下载文件。
直接下载: 只需单击模型文件旁边的下载图标即可直接下载它。

让我们看看第一种方法是如何工作的。当您复制 gguf 文件的 URL 时，您将看到以下链接：

https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/blob/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

将 URL 中的 blob 替换为 resolve。 URL 现在看起来像这样：

https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

您现在可以使用此 URL 和 curl 工具下载 gguf 文件：

$ curl -L -o models/mistral-7b-instruct-v0.2.Q4_K_M.gguf "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf"

-o 选项指定将模型下载到哪里。在此示例中，该模型被下载到 models 文件夹：

llama.cpp
    |___build
         |___bin
              |___llama-cli
              |___llama-server
    |___models
         |___mistral-7b-instruct-v0.2.Q4_K_M.gguf
         |___ggml-vocab-aquila.gguf
         |___ggml-vocab-baichuan.gguf
         |___...

对于第二种方法，下载模型后，将其移动到 models 文件夹中。

以交互方式运行模型

现在模型已下载完成，您可以使用 llama-cli 工具以交互方式运行它：

$ ./build/bin/llama-cli -m models/mistral-7b-instruct-v0.2.Q4_K_M.gguf

llama-cli 工具位于 /build/bin 文件夹中。-m 选项指定要运行的模型。

您现在将看到模型正在加载。加载完成后，您可以向它提问（如下以粗体显示）：

build: 4621 (6eecde3c) with Apple clang version 16.0.0 (clang-1600.0.26.4) for arm64-apple-darwin24.1.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M1 Max) - 21845 MiB free
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from models/mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096

...
...

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Using default system message. To change it, set a different value via -p PROMPT or -f FILE argument.

[INST] You are a helpful assistant

> Tell me a joke
 Of course, I'd be happy to share a joke with you! Here's one that always makes me laugh:

Why don't scientists trust atoms?

Because they make up everything!

I hope you found that amusing. Do you have any other requests or questions?

>

在 Python 中使用模型

大多数开发人员会希望使用 llama.cpp 在 Python 中使用该模型。为此，首先安装 llama-cpp-python 库：

!pip install llama-cpp-python

使用 Llama 类，并将其 model_path 参数设置为指向您之前下载的模型：

from llama_cpp import Llama

llm = Llama(model_path="/Users/weimenglee/llama.cpp/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf")

您现在可以使用 create_chat_completion() 方法向 LLM 提问：

llm.create_chat_completion(
 messages = [
  {
   "role": "user",
   "content": 'Tell me a joke'
  }
 ]
)

以下显示了该方法返回的结果：

{'id': 'chatcmpl-75b68f01-b66e-4277-b453-7becc4ff8b6d',
 'object': 'chat.completion',
 'created': 1738567004,
 'model': '/Users/weimenglee/llama.cpp/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': " Why don't scientists trust atoms?\n\nBecause they make up everything!"},
   'logprobs': None,
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 12, 'completion_tokens': 16, 'total_tokens': 28}}

使用 Llama 类下载模型

前面您看到了如何手动下载模型，以便您可以使用 llama.cpp 运行它。实际上，有一种更简单的方法可以下载模型。

同样，为了说明，我们将使用位于 https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF 的 Mistral-7B-Instruct-v0.2-GGUF 模型。

在页面的右侧，您将看到“使用此模型”按钮。单击此按钮并选择 llama-cpp-python：

单击此按钮将显示如何使用 Python 中的 llama.cpp 使用此模型：

您现在可以在 Python 中使用以下语句，使用 from_pretrained() 方法自动下载模型：

from llama_cpp import Llama

llm = Llama.from_pretrained(
 repo_id = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
 filename = "mistral-7b-instruct-v0.2.Q2_K.gguf",
)

请注意，如果您以这种方式下载模型，该模型将保存在 Hugging Face 的默认缓存文件夹 (~/.cache/huggingface) 中

您现在可以像往常一样提问：

llm.create_chat_completion(
 messages = [
  {
   "role": "user",
   "content": 'Tell me a joke'
  }
 ]
)

将 LLM 部署为 REST API

除了以交互方式在您的 Python 代码中运行模型之外，您还可以使用 llama.cpp 通过 REST API 访问该模型。

为此，请使用位于 llama.cpp 文件夹的 /build/bin 文件夹中的 llama-server 工具：

$ ./build/bin/llama-server -m models/mistral-7b-instruct-v0.2.Q4_K_M.gguf --host 0.0.0.0 --port 8080

上面的命令将 mistral-7b-instruct-v0.2.Q4_K_M.gguf 模型作为 REST API 在本地运行，监听端口 8080。

如果您计划使用 Chainlit 构建您的 UI，请将您的 REST API 配置为使用 8000 以外的端口（例如 8080），因为 8000 是 Chainlit 中使用的默认端口。

服务器启动并运行后，您可以打开 Web 浏览器并导航到 http://localhost:8080。这将启动 llama.cpp 附带的聊天应用程序。

您现在可以与模型聊天：

使用 REST API

要测试 REST API，您可以使用以下 curl 命令：

$ curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    "prompt": "What is the capital of France?",
    "temperature": 0.7,
    "max_tokens": 50
  }'

如果服务运行正常，您应该会看到如下响应：

{
  "choices":[
    {
      "text":"\n\nParis\n\nParis is the capital city of France. It is one
              of the most famous cities in the world and is known for its
              art, culture, fashion, and cuisine. Paris is located in the
              northern part of France",
      "index":0,
      "logprobs":null,
      "finish_reason":"length"
    }
  ],
  "created":1738567963,
  "model":"mistral-7b-instruct-v0.2.Q4_K_M.gguf",
  "system_fingerprint":"b4621-6eecde3c",
  "object":"text_completion",
  "usage":{
    "completion_tokens":50,
    "prompt_tokens":8,
    "total_tokens":58
  },
  "id":"chatcmpl-qgf1kCkgqM81x4A5GEpjnz8Cg0TydmW2",
  "timings":{
    "prompt_n":8,
    "prompt_ms":120.665,
    "prompt_per_token_ms":15.083125,
    "prompt_per_second":66.29925827704803,
    "predicted_n":50,
    "predicted_ms":1141.117,
    "predicted_per_token_ms":22.82234,
    "predicted_per_second":43.81671642785096
  }
}

要在 Python 中使用 REST API，您可以使用 requests 库，并使用 POST 方法与模型通信：

import requests
import json

### Define the API endpoint
url = "http://localhost:8080/v1/completions"

### Define the payload (data to send in the request)
payload = {
    "model": "mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    "prompt": "What is the capital of France?",
    "temperature": 0.7,
    "max_tokens": 50
}

### Send a POST request to the API
headers = {"Content-Type": "application/json"}
try:
    response = requests.post(url, json=payload, headers=headers)

## Check if the request was successful
    if response.status_code == 200:
        # Parse the response JSON
        response_data = response.json()

## Extract the result from the response
        choices = response_data.get("choices", [])
        if choices:
            result = choices[0].get("text", "")
            print("Response:", result)
        else:
            print("No choices found in the response.")
    else:
        print(f"Request failed with status code {response.status_code}: {response.text}")
except Exception as e:
    print(f"Error occurred: {e}")

您应该会看到如下响应：

Response:  Paris

Paris is the capital city of France. It is the most populous city in France, with a population of over 12 million people in its metropolitan area. Paris is known for its iconic landmarks such as the

总结

希望本文能为您提供对 llama.cpp 的扎实介绍，并快速开始使用它在您的 CPU 上高效运行模型。通过按照概述的步骤操作，您现在应该能够下载模型并在本地以交互方式或作为 REST API 运行它们。有了这些知识，您就可以利用 llama.cpp 在资源受限的环境中部署大型语言模型，从而实现流畅且经济高效的 AI 驱动应用程序。