如何使用OpenAI对LLM进行微调

Mon Mar 11 2024

欢迎回到我们关于微调语言模型（LLM）的系列文章！在我们之前的文章中，我们探讨了使用Hugging Face进行LLM微调的方法。今天，我们将把重点转向OpenAI平台。尽管许多人将OpenAI主要与ChatGPT和用于集成AI功能的API密钥访问联系在一起，但还有另一个强大的功能：根据您的特定需求对模型进行微调的能力。这个过程允许您在使用预训练模型的广泛知识库的同时，确保与您的特定数据集的兼容性和优化。

在本博客中，我们将为您介绍如何使用OpenAI API对模型进行微调。无论您是想让聊天机器人更好，创建新类型的故事，还是建立一个回答问题的系统，本文都将向您展示如何充分利用OpenAI的微调功能。

# 先决条件

在开始之前，请确保您已安装了必要的软件包。您将需要datasets软件包来处理我们的数据，以及openai软件包来与OpenAI API进行交互。打开终端并输入以下命令。

pip install datasets openai

datasets库是一个多功能的工具，用于加载和操作数据集，特别适用于机器学习工作流程。

相关文章：如何构建推荐模型 (opens new window)

# 加载数据集

我们将从加载数据集开始。为了演示目的，我们使用了Hugging Face上可用的一个数据集。以下是加载数据集的方法：

from datasets import load_dataset

# 从Hugging Face加载数据集
dataset = load_dataset("lamini/lamini_docs")

在这段代码中，我们加载了名为"lamini/lamini_docs"的数据集。如果您的数据集名称不同，请进行相应的更改。

# 探索数据集

在进行微调之前，了解数据集的结构非常重要。让我们来看一下我们的数据集的结构：

dataset

这将打印出如下结果：

DatasetDict({
    train: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 1260
    })
    test: Dataset({
        features: ['question', 'answer', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 140
    })
})

下一步是从数据集中提取所需的数据，并为训练做好准备。

相关文章：构建图像搜索应用 (opens new window)

# 为微调格式化数据

数据集分为训练集和测试集。我们只使用训练数据。让我们提取出来。

import pandas as pd
train_dataset = dataset['train']
train_df = pd.DataFrame(train_dataset)
questions_answers = train_df[['question', 'answer']]

在这一步中，我们只从数据框中提取问题和答案，因为在微调中，我们主要需要这两个元素。

OpenAI要求以特定的JSONL格式提供数据进行微调。每一行必须是一个表示单个训练示例的JSON对象。以下是如何格式化您的数据：

with open('finetune_data_chat_format.jsonl', 'w') as jsonl_file:
    for index, example in questions_answers.iterrows():
        formatted_data = {
            "messages": [
                {"role": "system", "content": "You're a helpful assistant"}, 
                {"role": "user", "content": example['question']},
                {"role": "assistant", "content": example['answer']}
            ]
        }
        jsonl_file.write(json.dumps(formatted_data) + '\\n')

注意：我们的目标是创建一个聊天机器人，并使用gpt-3.5-turbo进行微调，这就是为什么我们使用了对话式聊天格式。您可以在OpenAI的聊天格式 (opens new window)页面上查看其他格式。

# 将数据集上传到OpenAI

在进行微调之前，您需要将格式化的数据集上传到OpenAI：

from openai import OpenAI
from pathlib import Path

client = OpenAI(api_key="your_api_key")
response = client.files.create(
  file=Path('finetune_data_chat_format.jsonl'),
  purpose='fine-tune'
)

注意：请安全地存储您的API密钥，不要在共享或公共代码库中公开它。purpose='fine-tune'参数表示上传的文件用于模型训练。

# 启动微调过程

在上传了您的数据之后，您现在可以开始微调过程：

fine_tune_response = client.fine_tuning.jobs.create(
  training_file=response.id,  # 使用上传文件的ID
  model="gpt-3.5-turbo"       # 指定要微调的模型
)

print("微调作业已启动，ID为：", fine_tune_response.id)

这将在所选模型上开始微调过程。该作业ID用于跟踪微调作业的进度。

注意：训练完成后，您将收到一封带有模型名称的电子邮件。您将在测试部分使用该模型名称。

Boost Your AI App Efficiency now

Free Trial

Explore our product

# 监控微调进度

您可以按照以下方式监控微调作业的状态：

client.fine_tuning.jobs.retrieve("your_fine_tune_job_id")

将"your_fine_tune_job_id"替换为作业创建步骤返回的ID。此命令提供有关作业状态和性能的详细信息。

# 测试微调模型

在微调完成后，现在是时候测试您的模型了。以下是如何使用您微调的模型生成补全：

completion = client.chat.completions.create(
  model="your_fine_tuned_model_name",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Your message here"}
  ],
  max_tokens=50
)
print(completion.choices[0].message.content)

将"your_fine_tuned_model_name"和"Your prompt here"分别替换为您的模型名称和测试提示。

到目前为止，您应该已经对您的需求进行了微调，但这只是使用OpenAI的开始。该平台允许您训练更高级和复杂的模型，扩展您的AI可能性。

Join Our Newsletter

# 使用Hugging Face和OpenAI进行微调LLMs的比较

在全面了解了两种技术的概述和工作原理之后，让我们进行一些比较。

Factor	Hugging Face	OpenAI
Ease of Use	User-friendly interface. Comprehensive documentation but requires strong machine learning background.	Straightforward and requires some machine learning familiarity.
Model Availability	Wide range of pre-trained models (BERT, GPT, etc.).	Mainly focuses on GPT variants, with high optimization. Also offers Jurassic-1 Jumbo (QA) and Codex (code generation).
Customization	Extensive customization options for fine-tuning.	Simplified customization process, less granular than Hugging Face.
Data Privacy	Strong options: allows local or private cloud processing.	Strong, primarily cloud-based. May not suit all data sensitivity needs.
Performance	Varies by model and settings, scalable with proper hardware.	High performance, especially in language understanding and generation.
Scalability	User-managed scalability, depending on hardware and dataset size.	Managed by OpenAI, less user concern for infrastructure.
Cost	Free and paid tiers, cost-effective with good management (especially local processing).	Usage-based pricing, can be expensive at scale.
Community & Support	Large, active community with forums, tutorials, and shared projects.	Strong official channels and documentation, less community-driven.
Additional Features	TRL Library simplifies fine-tuning (SFT, RLHF)	User-friendly API for application integration

这两种微调技术各有优缺点，但主要取决于您的用例。如果您需要数据隐私并具备一定的技术知识，可以选择Hugging Face；否则，您可以选择OpenAI。

# 结论

使用OpenAI API对LLMs进行微调为您提供了一种简化、强大的方法，以满足您特定需求的自定义语言模型。通过按照本文中的步骤进行微调，您可以高效地微调您的模型，确保它提供定制化、高质量的结果。请记住，微调的有效性在很大程度上取决于训练数据的质量和相关性。投入时间来策划和构建您的数据集，以确保从微调工作中获得最佳结果。

微调模型可能涉及为输入数据生成嵌入或向量的过程。在某些情况下，这些嵌入可以存储在向量数据库中，以实现高效的检索或相似性搜索。例如，如果您为特定应用（如文档分类）微调了语言模型。MyScale (opens new window)是一个针对AI应用设计的SQL向量数据库，可实现快速的检索或相似性搜索性能。对于开发人员来说，它非常易于使用，只需要使用SQL进行交互。

如有反馈或建议，请在MyScale的Discord (opens new window)上与我们联系。

使用Hugging Face和OpenAI进行微调LLMs的比较

结论