Promptfoo — Prompt 回归测试与多模型评测框架

GitHub: promptfoo/promptfoo
Stars: 22,200+ | License: MIT | 现已并入 OpenAI，保持开源
官网: promptfoo.dev

项目速览

Promptfoo 是一个面向开发者的 LLM Prompt 评测与回归测试框架。与手工在 Playground 中反复调整 Prompt 不同，Promptfoo 将 Prompt 测试流程工程化：通过声明式 YAML 配置文件定义测试用例、模型提供商和断言规则，一行命令即可运行完整的评测套件，并在本地 Web UI 中查看多模型并排对比结果。

Promptfoo 的核心理念是”Prompt 即代码”。Prompt 应当像代码一样接受版本控制、回归测试和 CI 集成。当你修改了一个 Prompt 模板时，Promptfoo 自动在所有测试用例上运行新旧版本并对比结果，防止 Prompt “退化”。同时它还支持 A/B 模型对比——同一个 Prompt 在 GPT、Claude、Gemini 上的表现如何？一页 Web View 尽收眼底。

该项目由 Ian Webster 创建，最初专注于 CLI 端的多模型 Prompt 评测。2025 年底被 OpenAI 收购后仍保持 MIT 开源许可证，现已发展为覆盖 Prompt 评测、Red Teaming 安全检测、RAG 质量评估和 CI/CD 代码扫描的完整平台。据项目文档披露，Promptfoo 已在服务超过 1000 万用户的生产环境中经过实战验证，OpenAI 和 Anthropic 内部均在使用。截至 2026 年 6 月，GitHub Star 数超过 22,200。

功能概述

Prompt 回归测试

Promptfoo 的核心功能是自动化的 Prompt 回归测试。你通过 YAML 配置文件定义测试用例（tests）和断言（assert），Promptfoo 在每个测试用例上运行 Prompt、执行断言并生成详细的评测报告。

最小配置示例——翻译 Prompt 的回归测试：

prompts:
  - 'Convert the following English text to {{language}}: {{input}}'

providers:
  - openai:chat:gpt-5.4
  - openai:chat:gpt-5.4-mini
  - anthropic:messages:claude-opus-4-6
  - google:gemini-3.1-pro-preview

tests:
  - vars:
      language: French
      input: Hello world
    assert:
      - type: contains
        value: 'Bonjour le monde'
  - vars:
      language: Spanish
      input: Where is the library?
    assert:
      - type: icontains
        value: 'Dónde está la biblioteca'

Prompt 使用 Nunjucks 模板语法 {{variable}} 定义变量占位符。每次修改 Prompt 模板后，运行 promptfoo eval 即可验证所有测试用例是否仍然通过。

多模型 A/B 对比

Promptfoo 支持 60+ 模型提供商，包括 OpenAI、Anthropic、Google、Azure、AWS Bedrock、Ollama、DeepSeek、Groq 等。在 providers 列表中声明多个模型，即可自动进行并排对比评测：

prompts:
  - 'Solve this riddle: {{riddle}}'

providers:
  - openai:chat:gpt-5.4
  - openai:chat:gpt-5.4-mini

tests:
  - vars:
      riddle: 'I speak without a mouth and hear without ears. I have no body, but I come alive with wind. What am I?'
    assert:
      - type: contains
        value: echo
      - type: llm-rubric
        value: Do not apologize
  - vars:
      riddle: 'The more of this there is, the less you see. What is it?'
    assert:
      - type: contains
        value: darkness

也可以通过命令行动态覆盖 providers，无需修改配置文件：

npx promptfoo@latest eval -r google:gemini-3.1-pro-preview google:gemini-2.5-pro

promptfoo view 命令打开本地 Web 界面，以矩阵形式展示每个测试用例在每个模型上的耗时、成本和通过/失败状态，方便快速定位问题。

丰富的断言类型

Promptfoo 提供多种断言类型，覆盖从简单文本匹配到 LLM 评判的全频谱：

断言类型	用途	示例
`contains`	输出包含指定文本	`value: 'echo'`
`icontains`	大小写不敏感的包含匹配	`value: 'Dónde está la biblioteca'`
`regex`	正则表达式匹配	`value: '\\d{3}-\\d{4}'`
`llm-rubric`	用 LLM 评判输出质量（免费形式指令）	`value: 'Do not mention that you are an AI'`
`model-graded-closedqa`	LLM 评判封闭式问答正确性	基于参考答案评分
`javascript`	自定义 JS 评分函数 (0-1)	`value: 'Math.max(0, Math.min(1, 1 - (output.length - 100) / 900))'`
`cost`	推理成本阈值（USD）	`threshold: 0.002`
`latency`	响应延迟阈值（ms）	`threshold: 3000`
`is-json`	验证输出是否为合法 JSON	无需 value
`factuality`	RAG 事实性评估	基于上下文评估
`answer-relevance`	RAG 答案相关性评估	基于问题评估
`context-recall`	RAG 上下文召回率	基于 ground truth
`context-relevance`	RAG 上下文相关性	评估检索质量

defaultTest 全局断言

通过 defaultTest 可以为所有测试用例设置全局断言，避免在每个测试用例中重复声明：

defaultTest:
  assert:
    - type: cost
      threshold: 0.002
    - type: latency
      threshold: 3000

全局断言可被单个测试用例的 assert 块覆盖或补充。

CI/CD 集成

Promptfoo 提供原生的 CI/CD 集成，可以在 Pull Request 中自动运行 Prompt 评测。当 PR 修改了 Prompt 模板文件时，Promptfoo 自动运行回归测试并在 PR 评论区报告结果。此外还提供 Code Scanning Action，用于审查 PR 中的 LLM 相关安全与合规问题。

默认的运行模式是：如果所有测试用例通过，进程以 exit code 0 退出；如果有任何断言失败，以非零值退出，使 CI Pipeline 失败。

RAG 质量评估

Promptfoo 内置了 RAG 评估模块，可对检索增强生成系统的输出打分：

npx promptfoo@latest init --example eval-rag

评估维度包括 factuality（事实性）、answer-relevance（答案相关性）、context-recall（上下文召回率）、context-relevance（上下文相关性）和 context-faithfulness（上下文忠实度）。

Red Teaming 安全检测

Promptfoo 集成了 Red Teaming 安全扫描功能，支持对 LLM 应用进行漏洞扫描和安全报告生成。它遵循 NIST AI RMF 等安全框架，提供可配置的插件和策略体系。Red Teaming 配置同样通过 YAML 文件声明，支持自定义攻击策略和防护措施验证。

隐私与本地执行

所有 LLM 评测100% 在本地执行，你的 Prompt 和测试数据不会离开你的机器。Promptfoo 只是编排 API 调用，不会将 Prompt 上传到任何第三方服务器。这对处理敏感业务数据和专有 Prompt 的企业用户尤为重要。

适用场景

Prompt 模板版本管理：每次修改 Prompt 模板后运行回归测试，防止改动引入 Bug
模型选型评测：同一个 Prompt 在多个模型上 A/B 对比，综合成本、延迟和质量选择最优模型
模型升级验证：从 GPT-4o 升级到 GPT-5.4 时，批量运行评测确保质量不回退
RAG 系统调优：量化评估检索质量和生成质量，为 embedding 模型、chunk 策略和 top-k 参数选择提供数据支撑
安全合规审计：Red Teaming 扫描与代码审查集成，确保 LLM 应用符合安全规范
CI/CD 质量门禁：PR 级别的自动评测，不合规的 Prompt 变更无法合入主干
多语言支持验证：翻译 Prompt 在多语言上的表现一致性和准确性测试

快速上手

安装

# Node.js 方式（推荐）
npm install -g promptfoo

# 或使用 Homebrew
brew install promptfoo

# 或使用 pip
pip install promptfoo

# 或无需安装，直接用 npx
npx promptfoo@latest

前置要求：Node.js ^20.20.0 或 >=22.22.0。设置 API 密钥作为环境变量：

export OPENAI_API_KEY=sk-abc123
export ANTHROPIC_API_KEY=sk-ant-xyz789

初始化示例项目

promptfoo init --example getting-started
cd getting-started

这会创建一个包含完整 YAML 配置的示例项目目录。

运行评测

promptfoo eval

评测完成后，在终端中会输出每个测试用例的通过/失败状态。

查看结果

promptfoo view

本地启动 Web Viewer，在浏览器中以矩阵视图展示多模型并排对比结果。

核心命令速查

命令	用途
`promptfoo init`	初始化评测项目
`promptfoo init --example getting-started`	创建示例项目
`promptfoo eval`	运行评测套件
`promptfoo view`	启动 Web 结果查看器
`promptfoo eval -r provider1 provider2`	动态指定 providers 运行
`promptfoo feedback`	收集人工反馈标注
`promptfoo redteam`	运行 Red Teaming 扫描

源码架构

Promptfoo 以 Node.js/TypeScript 为核心技术栈，核心仓库结构如下：

promptfoo/
├── src/
│   ├── evaluators/     # 断言评测器（contains、llm-rubric、javascript 等）
│   ├── providers/      # 模型提供商适配器（60+ 提供商）
│   ├── redteam/        # Red Teaming 模块（插件、策略、扫描器）
│   ├── assertions/     # 断言引擎
│   ├── prompts/        # Prompt 加载与模板渲染（Nunjucks）
│   ├── generators/     # 测试用例生成器
│   ├── database/       # 评测结果持久化
│   ├── web/            # Web Viewer 前端
│   └── cli/            # 命令行接口
├── examples/           # 示例配置
│   ├── getting-started/
│   ├── eval-rag/
│   ├── compare-openai-models/
│   ├── eval-self-grading/
│   └── openai-agents-basic/
├── site/               # 文档站点源码
└── test/               # 测试套件

核心设计模式：

Provider Adapter：每个 LLM 提供商通过标准化的 Adapter 接口接入，新增提供商只需实现 callApi() 和 id() 方法
Pipeline Architecture：load prompts -> render with vars -> call providers -> run assertions -> generate report 的流水线执行模型
Declarative Config：所有评测配置通过 YAML 文件声明，支持 $schema 引用进行 IDE 自动补全
Caching Layer：内置缓存机制避免重复调用已评测的 Prompt+Provider 组合，节约 API 成本

实操 Demo

以下演示两个完整场景：多模型 A/B 对比评测、RAG 质量评估。

Demo 1：客服机器人 Prompt 的多模型 A/B 对比

步骤 1：创建 YAML 配置

# promptfooconfig.yaml
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: Customer support bot prompt comparison across models

prompts:
  - |
    You are a helpful customer support agent for an e-commerce platform.
    Be concise, polite, and do not make up information.

    Customer: {{question}}
    Response:

providers:
  - openai:chat:gpt-5.4
  - openai:chat:gpt-5.4-mini
  - anthropic:messages:claude-opus-4-6
  - google:gemini-3.1-pro-preview

defaultTest:
  assert:
    - type: cost
      threshold: 0.005
    - type: latency
      threshold: 5000
    - type: llm-rubric
      value: Do not mention that you are an AI or language model
    - type: llm-rubric
      value: Do not make up information that is not provided

tests:
  - vars:
      question: I received a damaged item, what should I do?
    assert:
      - type: contains
        value: return
      - type: llm-rubric
        value: The response is empathetic and acknowledges the customer's frustration

  - vars:
      question: Do you ship to Canada?
    assert:
      - type: llm-rubric
        value: Does not definitively say yes or no unless it is certain

  - vars:
      question: What is the meaning of life?
    assert:
      - type: llm-rubric
        value: Politely redirects to e-commerce related topics
      - type: llm-rubric
        value: Does not attempt to answer the philosophical question

  - vars:
      question: Can I cancel my order after it has been shipped?
    assert:
      - type: llm-rubric
        value: Explains the cancellation policy clearly without overpromising

  - vars:
      question: What payment methods do you accept?
    assert:
      - type: icontains
        value: credit card
      - type: llm-rubric
        value: Lists specific payment methods

步骤 2：运行评测

promptfoo eval

步骤 3：查看对比结果

promptfoo view

在 Web Viewer 中，你可以看到四个模型在每个测试用例上的：

生成文本并排对比
每个断言的通过/失败状态
API 调用成本（$）
响应延迟（ms）
LLM 评判的评语和分数

Demo 2：Prompt 质量自动评分与优化迭代

使用 llm-rubric 和 javascript 断言自动评分 Prompt 质量：

# promptfooconfig-quality.yaml
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: Prompt quality evaluation with LLM rubric and custom scoring

prompts:
  - file://prompts.txt

providers:
  - openai:chat:gpt-5.4

defaultTest:
  assert:
    - type: llm-rubric
      value: Do not mention that you are an AI or chat assistant
    - type: llm-rubric
      value: Response is helpful and directly addresses the question
    - type: javascript
      value: Math.max(0, Math.min(1, 1 - (output.length - 100) / 900))

tests:
  - vars:
      name: Bob
      question: Can you help me find a specific product on your website?
  - vars:
      name: Jane
      question: Do you have any promotions or discounts currently available?
  - vars:
      name: Alex
      question: How long does shipping usually take?
  - vars:
      name: Sam
      question: Can I return items purchased on sale?
  - vars:
      name: Priya
      question: Do you offer gift wrapping services?

这里的 javascript 断言定义了一个长度评分函数：输出接近 100 字符得满分，越长分数越低。结合 llm-rubric 的质量评判，可以从多个维度量化 Prompt 效果。

Demo 3：集成到 CI/CD

在 GitHub Actions 中集成 Promptfoo 评测：

# .github/workflows/prompt-eval.yml
name: Prompt Evaluation

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'promptfooconfig.yaml'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '22'
      - run: npx promptfoo@latest eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

当 PR 修改了 prompts/ 目录下的 Prompt 文件或 promptfooconfig.yaml 配置时，自动运行评测。任何断言失败都会使 CI 失败，阻止不合规的 Prompt 变更合入主干。

维度	Promptfoo	LangSmith	Braintrust
核心理念	开发者优先，CLI+YAML 配置驱动	LLM 应用全生命周期可观测平台	AI 评测与实验管理平台
部署方式	100% 本地运行	SaaS 平台 + 自托管	SaaS 平台
隐私保护	数据不离开机器	数据上传到 LangSmith 服务器	数据上传到 Braintrust 服务器
多模型对比	原生支持，60+ 提供商开箱即用	通过 LangChain 集成	自定义集成
CI/CD 集成	原生支持，exit code 驱动	通过 SDK + Webhook	通过 SDK + API
Red Teaming	内置安全扫描与 NIST AI RMF	无内置	无内置
RAG 评估	内置多维度 RAG 指标	通过 LangSmith 数据集	自定义实现
学习曲线	低，YAML 声明式配置	中等，需集成 LangChain	中等
开源	是（MIT）	部分开源	是
价格	免费	免费层 + 付费	免费层 + 付费

参考资源

官方文档： https://promptfoo.dev
GitHub 仓库： https://github.com/promptfoo/promptfoo
配置 Schema： https://promptfoo.dev/config-schema.json — 在 YAML 中引用可启用 IDE 自动补全
快速入门： https://promptfoo.dev/docs/getting-started/
Red Teaming 指南： https://promptfoo.dev/docs/red-team/quickstart/
RAG 评估： promptfoo init --example eval-rag — 内置 RAG 评估示例
相关阅读： 本文与《SKILL-DSPy》和《SKILL-Guidance》分别覆盖 Prompt 优化、受控生成和回归测试这三个 Prompt 工程的核心维度