Skip to content

Commit dc01e8b

Browse files
authored
Merge pull request #123 from WecoAI/dev
Merge Dev - Add langfuse integration and fix min/max score reporting (0.3.21)
2 parents 422d151 + c7c6c08 commit dc01e8b

21 files changed

Lines changed: 3182 additions & 11 deletions
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# ZephHR QA — LangFuse + Weco Example
2+
3+
Optimize a QA agent that answers HR policy questions over fictional ZephHR documentation.
4+
5+
This example demonstrates using Weco with LangFuse as the evaluation backend. It uses LangFuse datasets, local code evaluators, and managed LLM-as-a-Judge evaluators configured in the LangFuse UI.
6+
7+
## Prerequisites
8+
9+
- Python 3.10+
10+
- `uv pip install 'weco[langfuse]' openai langfuse`
11+
- Environment variables:
12+
```bash
13+
export OPENAI_API_KEY="..."
14+
export LANGFUSE_SECRET_KEY="sk-lf-..."
15+
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
16+
export LANGFUSE_BASE_URL="https://cloud.langfuse.com" # or https://us.cloud.langfuse.com
17+
```
18+
19+
## LangFuse UI Setup
20+
21+
Before running optimization, configure two **managed evaluators** (LLM-as-a-Judge) in your LangFuse project. These run server-side and score each agent response automatically.
22+
23+
1. Go to your project in [LangFuse](https://cloud.langfuse.com/)**Evaluation****Evaluators**
24+
2. Click **+ New Evaluator** and create two evaluators:
25+
26+
### Correctness evaluator
27+
28+
- **Name**: `Correctness`
29+
- **Score**: 0 or 1 (binary factual accuracy)
30+
- **Variable mappings**:
31+
- `{{input}}``$.input.question` (the user's question)
32+
- `{{output}}``$.output.answer` (the agent's answer)
33+
- `{{expected_output}}``$.expected_output.expected_answer` (the ground truth)
34+
35+
### Helpfulness evaluator
36+
37+
- **Name**: `Helpfulness`
38+
- **Score**: 0–1 continuous scale
39+
- **Variable mappings**:
40+
- `{{input}}``$.input.question`
41+
- `{{output}}``$.output.answer`
42+
43+
> **Important:** Use the **live preview** when configuring each evaluator to verify the variable mappings are picking up the correct data from your traces. The evaluator names are case-sensitive — `Correctness` and `Helpfulness` must match exactly what you pass to `--langfuse-managed-evaluators`.
44+
45+
The custom metric function `evaluators:qa_score` combines these scores locally: `Correctness * Helpfulness`.
46+
47+
## Setup
48+
49+
Create the LangFuse datasets:
50+
51+
```bash
52+
cd examples/langfuse-zeph-hr-qa
53+
python setup_dataset.py
54+
```
55+
56+
This creates two datasets: `zephhr-qa-opt` (optimization) and `zephhr-qa-holdout` (validation).
57+
58+
## Optimize
59+
60+
```bash
61+
weco run --source agent.py \
62+
--eval-backend langfuse \
63+
--langfuse-dataset zephhr-qa-opt \
64+
--langfuse-target agent:answer_hr_question \
65+
--langfuse-evaluators evaluators:json_schema_validity evaluators:conciseness \
66+
--langfuse-managed-evaluators Correctness Helpfulness \
67+
--langfuse-metric-function evaluators:qa_score \
68+
--additional-instructions optimizer_exemplars.md \
69+
--metric qa_score --goal maximize --steps 30
70+
```
71+
72+
## Holdout Validation
73+
74+
```bash
75+
weco run --source agent.py \
76+
--eval-backend langfuse \
77+
--langfuse-dataset zephhr-qa-holdout \
78+
--langfuse-target agent:answer_hr_question \
79+
--langfuse-evaluators evaluators:json_schema_validity evaluators:conciseness \
80+
--langfuse-managed-evaluators Correctness Helpfulness \
81+
--langfuse-metric-function evaluators:qa_score \
82+
--metric qa_score --goal maximize --steps 1
83+
```
84+
85+
## File Overview
86+
87+
| File | Purpose |
88+
|------|---------|
89+
| `agent.py` | Baseline QA agent (gpt-4o-mini) — Weco optimizes the prompt |
90+
| `evaluators.py` | LangFuse-format evaluators + `qa_score` metric function |
91+
| `setup_dataset.py` | Idempotent LangFuse dataset creation from JSON |
92+
| `docs.md` | ZephHR product documentation (knowledge base) |
93+
| `optimizer_exemplars.md` | Few-shot Q&A examples passed via `--additional-instructions` |
94+
| `data/optimization_questions.json` | Optimization set (15 questions) |
95+
| `data/holdout_questions.json` | Held-out validation set (10 questions) |
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
"""Baseline QA agent for ZephHR documentation.
2+
3+
Answers HR policy questions using only the docs.md knowledge base.
4+
Weco optimizes this file — specifically the SYSTEM_PROMPT and USER_TEMPLATE.
5+
"""
6+
7+
import json
8+
from pathlib import Path
9+
10+
from openai import OpenAI
11+
12+
client = OpenAI()
13+
14+
DOCS = Path(__file__).with_name("docs.md").read_text()
15+
16+
SYSTEM_PROMPT = """You are a ZephHR support assistant. Answer the user's question
17+
using ONLY the provided documentation. Do not guess or invent policy details.
18+
19+
If the documentation does not contain enough information to fully answer,
20+
say so clearly and state what IS covered.
21+
22+
Return your answer as JSON with exactly these fields:
23+
- answer: your response to the question (string)
24+
- confidence: how confident you are the answer is fully supported by the docs (high/medium/low)
25+
- relevant_sections: list of section names from the docs you referenced"""
26+
27+
USER_TEMPLATE = """Documentation:
28+
{docs}
29+
30+
Question: {question}
31+
32+
Return only JSON."""
33+
34+
35+
def answer_hr_question(inputs: dict) -> dict:
36+
"""Answer an HR policy question from the ZephHR docs."""
37+
question = inputs.get("question", "")
38+
39+
response = client.chat.completions.create(
40+
model="gpt-4o-mini",
41+
messages=[
42+
{"role": "system", "content": SYSTEM_PROMPT},
43+
{"role": "user", "content": USER_TEMPLATE.format(docs=DOCS, question=question)},
44+
],
45+
temperature=0.0,
46+
response_format={"type": "json_object"},
47+
)
48+
49+
try:
50+
parsed = json.loads(response.choices[0].message.content)
51+
except (TypeError, json.JSONDecodeError):
52+
parsed = {}
53+
54+
confidence = parsed.get("confidence", "low")
55+
if confidence not in ("high", "medium", "low"):
56+
confidence = "low"
57+
58+
relevant_sections = parsed.get("relevant_sections", [])
59+
if not isinstance(relevant_sections, list):
60+
relevant_sections = []
61+
62+
return {"answer": parsed.get("answer", ""), "confidence": confidence, "relevant_sections": relevant_sections}
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
[
2+
{
3+
"id": "hold-001",
4+
"question": "A new employee started 45 days ago and works full-time. Can they enroll in medical benefits yet?",
5+
"expected_answer": "Not yet. Full-time employees become eligible for benefits on the first day of the month following 60 days of employment. At 45 days, the employee has not yet reached the 60-day threshold, so they must wait."
6+
},
7+
{
8+
"id": "hold-002",
9+
"question": "Can the support team's geofencing feature be used on the Professional plan?",
10+
"expected_answer": "No. Geofencing for time and attendance is available only on the Enterprise plan."
11+
},
12+
{
13+
"id": "hold-003",
14+
"question": "An employee got married last month. Can they add their spouse to benefits outside of open enrollment?",
15+
"expected_answer": "Yes. Marriage is a qualifying life event that allows benefits changes outside of open enrollment. The employee must submit a qualifying life event request within 30 days of the marriage."
16+
},
17+
{
18+
"id": "hold-004",
19+
"question": "What's the maximum number of PTO days a 7-year employee can accumulate before accrual stops?",
20+
"expected_answer": "An employee with 6+ years of tenure earns 25 days per year. PTO accrual is capped at 1.5x the annual entitlement, which means the cap is 37.5 days. Once this balance is reached, accrual pauses until the balance drops below the cap."
21+
},
22+
{
23+
"id": "hold-005",
24+
"question": "Can a System Admin grant themselves the System Admin role?",
25+
"expected_answer": "No. System admins cannot grant the System Admin role to themselves. Another System Admin must make this change. All permission changes are recorded in the audit log and retained for 7 years."
26+
},
27+
{
28+
"id": "hold-006",
29+
"question": "When is open enrollment and when do changes take effect?",
30+
"expected_answer": "Open enrollment runs annually from November 1\u201315. Changes made during open enrollment take effect on January 1 of the following year."
31+
},
32+
{
33+
"id": "hold-007",
34+
"question": "We're on Enterprise. What integrations do we get that Professional doesn't have?",
35+
"expected_answer": "Enterprise includes everything in Professional plus REST API access, custom webhooks, and HRIS data sync with Workday/SAP SuccessFactors. Enterprise also has a higher API rate limit of 500 requests per minute compared to Professional's 100 requests per minute."
36+
},
37+
{
38+
"id": "hold-008",
39+
"question": "If payroll is due on Saturday the 15th, when does the payment actually go out?",
40+
"expected_answer": "If a payday falls on a weekend or public holiday, payment is issued on the preceding business day. So if the 15th is a Saturday, payment would be issued on Friday the 14th."
41+
},
42+
{
43+
"id": "hold-009",
44+
"question": "How many sick days does a part-time employee get and do they carry over?",
45+
"expected_answer": "Part-time employees receive 5 sick days per year. Sick days reset on January 1 and do not roll over to the next year."
46+
},
47+
{
48+
"id": "hold-010",
49+
"question": "An employee has been on sick leave for 4 consecutive days. Is there anything extra they need to do?",
50+
"expected_answer": "Yes. Sick leave beyond 3 consecutive days requires a doctor's note uploaded through the portal."
51+
}
52+
]
Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
[
2+
{
3+
"id": "opt-001",
4+
"question": "An employee was rehired 10 months after leaving. Do they need to redo full onboarding?",
5+
"expected_answer": "No. Employees rehired within 12 months of their termination date may use a streamlined re-hire flow that preserves their previous tax elections. Since 10 months is within the 12-month window, they qualify for the streamlined process."
6+
},
7+
{
8+
"id": "opt-002",
9+
"question": "Our company is on the Starter plan. Can we set up SSO?",
10+
"expected_answer": "SSO/SAML is not available on the Starter plan. It is available on the Professional plan as an add-on at $2 per user per month, and it is included with the Enterprise plan."
11+
},
12+
{
13+
"id": "opt-003",
14+
"question": "I'm a manager. Can I see how much my direct report makes?",
15+
"expected_answer": "No. Managers cannot view or edit compensation details for their reports. Only HR admins have access to compensation information."
16+
},
17+
{
18+
"id": "opt-004",
19+
"question": "We need to issue a bonus to an employee. What's the process?",
20+
"expected_answer": "Bonuses are processed as off-cycle payments. An HR admin must request the off-cycle payment, and it requires VP-level approval. Once approved, the off-cycle run is processed within 3 business days."
21+
},
22+
{
23+
"id": "opt-005",
24+
"question": "When does a new full-time employee become eligible for medical benefits?",
25+
"expected_answer": "Full-time employees (30+ hours/week) become eligible for benefits on the first day of the month following 60 days of employment."
26+
},
27+
{
28+
"id": "opt-006",
29+
"question": "What is the SLA for a broken SSO integration?",
30+
"expected_answer": "SSO failures are classified as P2 \u2013 High priority. The response time SLA is 4 hours with a resolution target of 1 business day."
31+
},
32+
{
33+
"id": "opt-007",
34+
"question": "Can an HR admin change their own salary in the system?",
35+
"expected_answer": "No. HR admins cannot modify their own compensation or benefits elections. Another HR admin must make the change. All permission changes are recorded in the audit log."
36+
},
37+
{
38+
"id": "opt-008",
39+
"question": "An hourly employee forgot to submit their timesheet by Monday 5 PM. When will they get paid?",
40+
"expected_answer": "Timesheet submissions for hourly employees must be approved by the direct manager by 5:00 PM local time on Monday of the pay week. Late submissions are processed in the next pay cycle with no exceptions."
41+
},
42+
{
43+
"id": "opt-009",
44+
"question": "We're on the Professional plan. What's our API rate limit?",
45+
"expected_answer": "On the Professional plan, the API rate limit is 100 requests per minute."
46+
},
47+
{
48+
"id": "opt-010",
49+
"question": "An employee who worked 32 hours/week just dropped to 28 hours/week for 2 months. Do they lose medical?",
50+
"expected_answer": "Not yet. Employees who drop below 30 hours/week lose medical benefits only after 3 consecutive months below the threshold. After 2 months, they still retain medical benefits. If they remain below 30 hours for a third consecutive month, they will be reclassified as part-time and lose medical benefits at the end of the third month, but they will retain dental and vision."
51+
},
52+
{
53+
"id": "opt-011",
54+
"question": "Can a department block PTO for 6 weeks during a product launch?",
55+
"expected_answer": "No. Departments may declare blackout periods of a maximum of 4 weeks per year. A 6-week blackout would exceed the maximum allowed duration. Blackout periods must also be announced at least 30 days in advance."
56+
},
57+
{
58+
"id": "opt-012",
59+
"question": "Does ZephHR handle payroll taxes for our employees in the UK?",
60+
"expected_answer": "ZephHR automatically calculates payroll taxes only for US and Canadian employees. For employees in other jurisdictions like the UK, payroll tax calculations must be configured manually by an HR admin with the Payroll Configuration permission."
61+
},
62+
{
63+
"id": "opt-013",
64+
"question": "How long is COBRA coverage and who manages it?",
65+
"expected_answer": "COBRA coverage extends for up to 18 months. COBRA administration is handled by the third-party vendor HealthBridge, not directly through ZephHR. COBRA continuation is offered within 14 days of the employee's termination date."
66+
},
67+
{
68+
"id": "opt-014",
69+
"question": "I want to request 5 days of PTO starting next week. What approvals do I need?",
70+
"expected_answer": "Requests of 4 or more days require approval from both your direct manager and the department head. Additionally, planned absences must be submitted at least 5 business days in advance."
71+
},
72+
{
73+
"id": "opt-015",
74+
"question": "We want to downgrade from Professional to Starter mid-contract. Will we get a refund?",
75+
"expected_answer": "Downgrading from Professional to Starter mid-contract forfeits access to Professional features immediately with no prorated refund."
76+
}
77+
]

0 commit comments

Comments
 (0)