|
| 1 | +# ZephHR QA — LangFuse + Weco Example |
| 2 | + |
| 3 | +Optimize a QA agent that answers HR policy questions over fictional ZephHR documentation. |
| 4 | + |
| 5 | +This example demonstrates using Weco with LangFuse as the evaluation backend. It uses LangFuse datasets, local code evaluators, and managed LLM-as-a-Judge evaluators configured in the LangFuse UI. |
| 6 | + |
| 7 | +## Prerequisites |
| 8 | + |
| 9 | +- Python 3.10+ |
| 10 | +- `uv pip install 'weco[langfuse]' openai langfuse` |
| 11 | +- Environment variables: |
| 12 | + ```bash |
| 13 | + export OPENAI_API_KEY="..." |
| 14 | + export LANGFUSE_SECRET_KEY="sk-lf-..." |
| 15 | + export LANGFUSE_PUBLIC_KEY="pk-lf-..." |
| 16 | + export LANGFUSE_BASE_URL="https://cloud.langfuse.com" # or https://us.cloud.langfuse.com |
| 17 | + ``` |
| 18 | + |
| 19 | +## LangFuse UI Setup |
| 20 | + |
| 21 | +Before running optimization, configure two **managed evaluators** (LLM-as-a-Judge) in your LangFuse project. These run server-side and score each agent response automatically. |
| 22 | + |
| 23 | +1. Go to your project in [LangFuse](https://cloud.langfuse.com/) → **Evaluation** → **Evaluators** |
| 24 | +2. Click **+ New Evaluator** and create two evaluators: |
| 25 | + |
| 26 | +### Correctness evaluator |
| 27 | + |
| 28 | +- **Name**: `Correctness` |
| 29 | +- **Score**: 0 or 1 (binary factual accuracy) |
| 30 | +- **Variable mappings**: |
| 31 | + - `{{input}}` → `$.input.question` (the user's question) |
| 32 | + - `{{output}}` → `$.output.answer` (the agent's answer) |
| 33 | + - `{{expected_output}}` → `$.expected_output.expected_answer` (the ground truth) |
| 34 | + |
| 35 | +### Helpfulness evaluator |
| 36 | + |
| 37 | +- **Name**: `Helpfulness` |
| 38 | +- **Score**: 0–1 continuous scale |
| 39 | +- **Variable mappings**: |
| 40 | + - `{{input}}` → `$.input.question` |
| 41 | + - `{{output}}` → `$.output.answer` |
| 42 | + |
| 43 | +> **Important:** Use the **live preview** when configuring each evaluator to verify the variable mappings are picking up the correct data from your traces. The evaluator names are case-sensitive — `Correctness` and `Helpfulness` must match exactly what you pass to `--langfuse-managed-evaluators`. |
| 44 | +
|
| 45 | +The custom metric function `evaluators:qa_score` combines these scores locally: `Correctness * Helpfulness`. |
| 46 | + |
| 47 | +## Setup |
| 48 | + |
| 49 | +Create the LangFuse datasets: |
| 50 | + |
| 51 | +```bash |
| 52 | +cd examples/langfuse-zeph-hr-qa |
| 53 | +python setup_dataset.py |
| 54 | +``` |
| 55 | + |
| 56 | +This creates two datasets: `zephhr-qa-opt` (optimization) and `zephhr-qa-holdout` (validation). |
| 57 | + |
| 58 | +## Optimize |
| 59 | + |
| 60 | +```bash |
| 61 | +weco run --source agent.py \ |
| 62 | + --eval-backend langfuse \ |
| 63 | + --langfuse-dataset zephhr-qa-opt \ |
| 64 | + --langfuse-target agent:answer_hr_question \ |
| 65 | + --langfuse-evaluators evaluators:json_schema_validity evaluators:conciseness \ |
| 66 | + --langfuse-managed-evaluators Correctness Helpfulness \ |
| 67 | + --langfuse-metric-function evaluators:qa_score \ |
| 68 | + --additional-instructions optimizer_exemplars.md \ |
| 69 | + --metric qa_score --goal maximize --steps 30 |
| 70 | +``` |
| 71 | + |
| 72 | +## Holdout Validation |
| 73 | + |
| 74 | +```bash |
| 75 | +weco run --source agent.py \ |
| 76 | + --eval-backend langfuse \ |
| 77 | + --langfuse-dataset zephhr-qa-holdout \ |
| 78 | + --langfuse-target agent:answer_hr_question \ |
| 79 | + --langfuse-evaluators evaluators:json_schema_validity evaluators:conciseness \ |
| 80 | + --langfuse-managed-evaluators Correctness Helpfulness \ |
| 81 | + --langfuse-metric-function evaluators:qa_score \ |
| 82 | + --metric qa_score --goal maximize --steps 1 |
| 83 | +``` |
| 84 | + |
| 85 | +## File Overview |
| 86 | + |
| 87 | +| File | Purpose | |
| 88 | +|------|---------| |
| 89 | +| `agent.py` | Baseline QA agent (gpt-4o-mini) — Weco optimizes the prompt | |
| 90 | +| `evaluators.py` | LangFuse-format evaluators + `qa_score` metric function | |
| 91 | +| `setup_dataset.py` | Idempotent LangFuse dataset creation from JSON | |
| 92 | +| `docs.md` | ZephHR product documentation (knowledge base) | |
| 93 | +| `optimizer_exemplars.md` | Few-shot Q&A examples passed via `--additional-instructions` | |
| 94 | +| `data/optimization_questions.json` | Optimization set (15 questions) | |
| 95 | +| `data/holdout_questions.json` | Held-out validation set (10 questions) | |
0 commit comments