Fine-tuning the llama3 Model with Synthetic Data in 1 Hour: A Practical Guide to Rapidly Improve Code Generation Quality

Rifx.Online
Large Language Models , AI Applications , MLOps
05 Mar, 2025

我经常与大型语言模型 (LLM) 讨论代码生成，并分享了大量经验，包括来自 60 多个 AI 项目的陷阱和最佳实践，包括缺乏模块化和错误处理。

有许多用于质量控制的工具，但有时创建一个更小的“大脑”来根据一组人类语言策略检查代码，而无需编写任何代码，会更具吸引力。目标是微调一个小模型（例如 llama3），以仔细检查由其他模型生成的任何代码（包括可能在长时间对话后变得疲倦并开始生成低质量代码的较大模型）。

备注标题承诺在一小时内完成这个项目，这是可行的。不要被这篇博文的长度所迷惑，很多事情都是可选的（比如基准测试等），但核心几乎是零代码，几乎任何玩过 Chatgpt 或 Claude 的人都可以做到。

— -

详细说明

微调一个小型且廉价的模型来审查由其他模型（无论大小）生成的代码。这是根据我们定义的一组策略完成的。我们将向模型提供一个代码片段，它将针对每个标准回复一个分数。根据模型的响应，我们可以接受或修改正在审查的代码。
微调和推理费用低于 3 美元。 确切的定价可以在 fireworks.ai 上找到。或者，您可以使用任何其他框架，如 Hugging Face 及其自动训练功能。但是，我个人喜欢 Fireworks，因为它易于训练和部署，而且我发现推理速度很快。当然，您可以选择适合您用例的任何内容。

一个示例响应应如下所示（以便能够被另一个工具（如 CI/CDI 管道）处理）。

`{
  "modularity_score": "7",
  "modularity_description": "The code is well-organized into independent sections, such as the HTML structure, CSS styles, and content.....",
  "error_handling_score": "5",
  "error_handling_description": "The code does not include any error handling mechanisms. ...",
  "logging_score": "6",
  "logging_description": "The code does not include any logging statements. While it's effectively.",
  "explanation": "The code is well-structured and follows standard HTML and CSS practices. However, it lacks modularity in the CS.... code."
}

— -

我们需要什么

kiln frameowkr 用于生成合成数据和微调
一个 Fireworks API 密钥，用于 AI 帐户以微调模型并托管微调后的模型
（可选）OpenAI 或 Open Router 的 API 密钥（稍后查看）

— -

它如何运作的简要说明

我们通过定义一组代码审查要求（例如，模块化、错误处理等）来创建合成数据。
这些要求作为生成合成数据提示的基础。
这些提示由一个强大的模型（如 o4 或 Claude sonnet）完成。
然后，此数据用于微调一个更小、更快、更便宜的模型，例如 llama3。
然后，我们可以使用这个较小的模型在 VSCode、Claude AI 或任何我们喜欢的地方进行代码审查，例如使用 MCP。

— -

1. 配置 LLM 提供商

我们将使用两个不同的提供商：Fireworks 用于微调和托管微调后的模型，OpenAI 用于生成合成数据。或者，您也可以使用 Groq 通过一个强大的模型（如 llama 70B）来生成合成数据，这可能比 OpenAI 快得多。

http://localhost:8757/settings/providers

设置提供商后，创建一个新项目和任务：

创建新任务

功能需求

模块化
适当的错误处理
一致的日志记录

输出模式

{
  "type": "object",
  "properties": {"modularity_score": {
  "title": "modularity_score",
  "type": "string",
  "description": "Integer value from 1-10 that quantifies code modularity. 1=no separation of concerns; 10=perfectly modular with independent components."
},
"modularity_description": {
  "title": "modularity_description",
  "type": "string",
  "description": "Detailed explanation of modularity issues found in the code, highlighting specific areas for improvement."
},
"error_handling_score": {
  "title": "error_handling_score",
  "type": "string",
  "description": "Integer value from 1-10 that rates error handling implementation. 1=no error handling; 10=comprehensive handling of all edge cases."
},
"error_handling_description": {
  "title": "error_handling_description",
  "type": "string",
  "description": "Detailed explanation of error handling issues found in the code, with examples of missing or inadequate error handling."
},
"logging_score": {
  "title": "logging_score",
  "type": "string",
  "description": "Integer value from 1-10 that measures logging quality. 1=no logging; 10=complete context-aware logging with appropriate levels."
},
"logging_description": {
  "title": "logging_description",
  "type": "string",
  "description": "Detailed explanation of logging issues found in the code, noting inconsistencies and opportunities for better log practices."
},
"explanation": {
  "title": "explanation",
  "type": "string",
  "description": "Overall summary of the code review findings, highlighting the most critical issues and providing general recommendations."
}  },
  "required": ["modularity_score",
	"modularity_description",
	"error_handling_score",
	"error_handling_description",
	"logging_score",
	"logging_description",
	"explanation"]
}