Prompt Engineering

This example shows how Weco can iteratively improve a prompt for solving American Invitational Mathematics Examination (AIME) problems. The experiment runs locally, requires only two short Python files and a prompt guide, and aims to improve the accuracy metric.

Setup

Clone the repository:

git clone https://github.com/WecoAI/weco-cli.git
cd examples/prompt

Install the CLI and dependencies for the example:
```
pip install weco openai datasets
```
Create your OpenAI API key here, then run:
```
export OPENAI_API_KEY="your_key_here"
```

Files in this folder

File	Purpose
`optimize.py`	Holds the prompt template (instructing the LLM to reason step-by-step and use `\\boxed{}` for the final answer) and the mutable `EXTRA_INSTRUCTIONS` string. Weco edits only this file during the search.
`eval.py`	Downloads a small slice of the 2024 AIME dataset, calls `optimize.solve` in parallel, parses the LLM output (looking for `\\boxed{}`), compares it to the ground truth, prints progress logs, and finally prints an `accuracy:` line that Weco reads.

Now run Weco to optimize your prompt:

weco run --source optimize.py \
     --eval-command "python eval.py" \
     --metric score \
     --goal maximize \
     --steps 20 \
     --model o4-mini \
     --additional-instructions "Improve the prompt to get better scores. Focus on clarity, specificity, and effective prompt engineering techniques."

During each evaluation round, you will see log lines similar to the following:

[setup] loading 20 problems from AIME 2024 …
[progress] 5/20 completed, accuracy: 0.0000, elapsed 7.3 s
[progress] 10/20 completed, accuracy: 0.1000, elapsed 14.6 s
[progress] 15/20 completed, accuracy: 0.0667, elapsed 21.8 s
[progress] 20/20 completed, accuracy: 0.0500, elapsed 28.9 s
accuracy: 0.0500

Weco then mutates the prompt instructions in optimize.py, tries again, and gradually pushes the accuracy higher.

How it works

eval.py slices the Maxwell-Jia/AIME_2024 dataset to twenty problems for fast feedback. You can change the slice in one line within the script.
The script sends model calls in parallel via ThreadPoolExecutor, so network latency is hidden.
Every five completed items, the script logs progress and elapsed time.
The final line accuracy: value is the only part Weco needs for guidance.

Next Steps

Now that you've automated prompt engineering for yourself, check out our guide on Model Development or CUDA Kernel Engineering.

You can check out our CLI Reference to learn more about what you can do with the tool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompt Engineering

Setup

Files in this folder

How it works

Next Steps

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Prompt Engineering

Setup

Files in this folder

How it works

Next Steps