Skip to content

chore(evals): Update model evaluations 2026-05-26#135

Merged
janisz merged 1 commit into
mainfrom
chore/update-model-evaluation-2026-05-26
May 26, 2026
Merged

chore(evals): Update model evaluations 2026-05-26#135
janisz merged 1 commit into
mainfrom
chore/update-model-evaluation-2026-05-26

Conversation

@rhacs-bot
Copy link
Copy Markdown
Contributor

Automated weekly model evaluation update.

Models evaluated: gpt-5-mini
Date: 2026-05-26

This PR was automatically generated by the Model Evaluation workflow.

@rhacs-bot rhacs-bot requested a review from janisz as a code owner May 26, 2026 07:38
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 26, 2026

❌ 2 Tests Failed:

Tests completed Failed Passed Skipped
380 2 378 12
View the full list of 2 ❄️ flaky test(s)
::policy 1

Flake rate in main: 100.00% (Passed 0 times, Failed 44 times)

Stack Traces | 0s run time
- test violation 1
- test violation 2
- test violation 3
::policy 4

Flake rate in main: 100.00% (Passed 0 times, Failed 44 times)

Stack Traces | 0s run time
- testing multiple alert violation messages 1
- testing multiple alert violation messages 2
- testing multiple alert violation messages 3

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@github-actions
Copy link
Copy Markdown

E2E Test Results

Commit: fa76bca
Workflow Run: View Details
Artifacts: Download test results & logs

=== Evaluation Summary ===

  ✓ cve-clusters-general (assertions: 3/3)
  ✓ list-clusters (assertions: 3/3)
  ✓ cve-detected-workloads (assertions: 3/3)
  ✓ cve-cluster-does-exist (assertions: 3/3)
  ✓ cve-cluster-does-not-exist (assertions: 3/3)
  ✓ cve-detected-clusters (assertions: 3/3)
  ✓ cve-log4shell (assertions: 3/3)
  ~ rhsa-not-supported (assertions: 1/2)
      - MaxToolCalls: Too many tool calls: expected <= 4, got 7
  ✓ cve-multiple (assertions: 3/3)
  ✓ cve-nonexistent (assertions: 3/3)
  ✓ cve-cluster-list (assertions: 3/3)

Tasks:      11/11 passed (100.00%)
Assertions: 31/32 passed (96.88%)
Tokens:     ~67974 (estimate - excludes system prompt & cache)
MCP schemas: ~12562 (included in token total)
Agent used tokens:
  Input:  16455 tokens
  Output: 27327 tokens
Judge used tokens:
  Input:  60248 tokens
  Output: 47068 tokens

@janisz janisz merged commit 81ce9af into main May 26, 2026
10 checks passed
@janisz janisz deleted the chore/update-model-evaluation-2026-05-26 branch May 26, 2026 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants