Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Explore runnable examples that show how to use Weco to optimize ML models, promp
- [🧠 Prompt Engineering](#-prompt-engineering)
- [📊 Extract Line Plot — Chart to CSV](#-extract-line-plot--chart-to-csv)
- [🛰️ Model Development — Spaceship Titanic](#️-model-development--spaceship-titanic)
- [🕵️ Fraud Detection — IEEE-CIS](#️-fraud-detection--ieee-cis)

### Prerequisites

Expand All @@ -35,6 +36,7 @@ pip install weco
| 🧠 Prompt Engineering | Iteratively refine LLM prompts to improve accuracy | `openai`, `datasets`, OpenAI API key | [README](prompt/README.md) |
| 📊 Agentic Scaffolding | Optimize agentic scaffolding for chart-to-CSV extraction | `openai`, `huggingface_hub`, `uv`, OpenAI API key | [README](extract-line-plot/README.md) |
| 🛰️ Spaceship Titanic | Improve a Kaggle model training pipeline | `pandas`, `numpy`, `scikit-learn`, `torch`, `xgboost`, `lightgbm`, `catboost` | [README](spaceship-titanic/README.md) |
| 🕵️ Fraud Detection | Optimize a fraud pipeline on IEEE-CIS (real Vesta transactions) | `pandas`, `numpy`, `scikit-learn`, `lightgbm`, `pyarrow`, `kaggle` | [README](fraud-detection/README.md) |

---

Expand Down Expand Up @@ -162,8 +164,33 @@ weco run --source train.py \
--log-dir .runs/spaceship-titanic
```

### 🕵️ Fraud Detection — IEEE-CIS

Optimize a tabular fraud-detection pipeline on real Vesta payment data.
Reproduces Weco's
[fraud-detection case study](https://weco.ai/blog/framing-the-problem)
(baseline AUC 0.914 → pooled 6-seed mean 0.9305 ± 0.0035 with full
instructions at 200 steps).

- **Prereqs**: Kaggle API token + [join the competition](https://www.kaggle.com/c/ieee-fraud-detection)
- **Install Dependencies**: `pip install -r requirements.txt`
- **Prepare data** (once, ~2-3 min): `python prepare_data.py`
- **Run**:
```bash
cd examples/fraud-detection
weco run --source train.py \
--eval-command "python evaluate.py" \
--metric auc_roc \
--goal maximize \
--steps 50 \
--model gemini-3.1-pro-preview \
--additional-instructions instructions.md \
--eval-timeout 300 \
--log-dir .runs/fraud-detection
```

---

If you're new to Weco, start with **Hello World**, then try **LangSmith ZephHR QA** for a realistic LangSmith optimization workflow, explore **Triton** and **CUDA** for kernel engineering, **Prompt Engineering** for optimzing an LLM's prompt, **Extract Line Plot** for optimzing agentic scaffolds, or **Spaceship Titanic** for model development.
If you're new to Weco, start with **Hello World**, then try **LangSmith ZephHR QA** for a realistic LangSmith optimization workflow, explore **Triton** and **CUDA** for kernel engineering, **Prompt Engineering** for optimzing an LLM's prompt, **Extract Line Plot** for optimzing agentic scaffolds, **Spaceship Titanic** for model development, or **Fraud Detection** for a production-scale tabular ML case study.


4 changes: 4 additions & 0 deletions examples/fraud-detection/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
data/
.runs/
__pycache__/
*.pyc
157 changes: 157 additions & 0 deletions examples/fraud-detection/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# Fraud Detection (IEEE-CIS)

Optimize a tabular fraud-detection pipeline on the
[IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection) Kaggle
dataset (real Vesta payment transactions). Weco rewrites `train.py` — both
feature engineering and the LightGBM configuration — to maximize AUC-ROC on a
held-out, time-based validation split.

This example reproduces the setup from Weco's fraud-detection case study
([blog post](https://weco.ai/blog/framing-the-problem),
[code](https://github.com/WecoAI/fraud-detection-case-study)). Expected
improvement: **baseline ≈ 0.914 → full-pipeline pooled mean 0.9305 ± 0.0035**
after 200 steps with `gemini-3.1-pro-preview` and the instructions in
`instructions.md`.

## Prerequisites

1. **Kaggle API token**. Put a valid `kaggle.json` at `~/.kaggle/kaggle.json`
(see [Kaggle API credentials](https://github.com/Kaggle/kaggle-api#api-credentials)),
then `chmod 600 ~/.kaggle/kaggle.json` to silence the permissions warning.
2. **You must join the competition.** Visit
<https://www.kaggle.com/c/ieee-fraud-detection> and click "Late Submission" /
"Join Competition" to accept the rules. Without this,
`prepare_data.py` will fail with `403 Forbidden` from the Kaggle API —
this is the single most common first-time friction.
3. **Weco API key** (free tier is fine). See the
[Weco docs](https://docs.weco.ai).

## Setup

```bash
cd examples/fraud-detection

# Virtualenv is strongly recommended — modern Python installs (Debian/Ubuntu,
# recent Homebrew) refuse `pip install` to the system site-packages under
# PEP 668. If you skip this step you'll hit
# `error: externally-managed-environment`.
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# After activation, `python` resolves to the venv's interpreter.

pip install -r requirements.txt

# Downloads ~120MB of CSVs, builds a small 100K/25K parquet split.
# Time-based split: last 20% of transactions by TransactionDT = validation.
# ~2-3 minutes on a modern laptop.
python prepare_data.py
```

After this you should have:

```
data/
train_transaction.csv, train_identity.csv, test_*.csv # raw
base_train_small.parquet # 100K rows, time-ordered
base_val_small.parquet # 25K rows, later in time
```

## Quick sanity check

Run the baseline once to confirm everything loads:

```bash
python evaluate.py
# → auc_roc: 0.914xxx (takes ~30s)
```

If you see an AUC in the 0.91-0.92 range, you're ready.

## Run Weco

The "default" run uses the full EDA + techniques instructions (recommended —
they contain the column semantics and known-good techniques for this dataset):

```bash
weco run --source train.py \
--eval-command "python evaluate.py" \
--metric auc_roc \
--goal maximize \
--steps 50 \
--model gemini-3.1-pro-preview \
--additional-instructions instructions.md \
--eval-timeout 300 \
--log-dir .runs/fraud-detection
```

Expected trajectory:

- Steps 1–10: Weco explores — tries log-amount, simple aggregations, category
encodings. AUC moves into 0.918-0.925.
- Steps 10–50: builds UID-style features (card1 + addr1 + account-creation
estimate via `D1`), target encoding with out-of-fold protection, velocity
features. AUC climbs to 0.928-0.933.
- Beyond step 50: diminishing returns; the pooled mean across 6 seeds in our
case study was 0.9305 ± 0.0035.

## Explanation

- `--source train.py` — the file Weco rewrites. Both `build_features` and
`train_and_evaluate` are fair game.
- `--eval-command "python evaluate.py"` — called after every proposed edit;
reimports `train.py`, runs the pipeline, prints `auc_roc: 0.xxxxxx`. Weco
parses the last line matching `--metric`.
- `--metric auc_roc --goal maximize` — Weco optimizes the metric printed by
the evaluator.
- `--additional-instructions instructions.md` — injects domain context into
every optimization step. **This is what mostly matters.** See the
case study: EDA-level instructions (what each column means in this
specific dataset) drive most of the gain. Kaggle-classic techniques are
typically already in the LLM's pretraining distribution. Feed the optimizer
what it couldn't already know — dataset-specific semantics, proprietary
heuristics, internal constraints.
- `--eval-timeout 300` — one eval takes ~30-60s; 300s gives headroom for
feature-heavy proposals.

## Things to try

1. **No instructions baseline**: remove `--additional-instructions` and watch
variance across seeds balloon (std ~0.008 vs ~0.002 with instructions).
Also watch for silently-leaky proposals (see below).
2. **EDA only**: keep only the column-meaning section of `instructions.md` —
the case study found this accounts for most of the mean gain.
3. **Scope restriction**: point Weco at `train.py`'s `build_features` only by
editing the file to expose just that function (or split the pipeline into
`features.py` + `model.py`). In our case study, features-only delivered
most of the improvement that full-pipeline did.

## Watch out for silent target leakage

IEEE-CIS is a known trap for automated optimizers. A plausible idea like
"count how many columns are zero per row" becomes leaky if the dataframe
still contains `isFraud`, because fraud rows contribute a different count
than non-fraud rows. The `build_features` in `train.py` drops `isFraud` and
`TransactionID` before any cross-column aggregation — don't let proposals
reintroduce aggregations on a dataframe that still contains the label.

Signs to check for when a run reports a surprisingly high AUC (> 0.95 on this
subsample):

- Any `df.sum`/`df.mean`/`(df == x)` across all columns before the label is
dropped.
- Target encoding without out-of-fold protection (encoder fit on train + val
concat).
- Features computed using validation data (time-leakage: using `val_df` in
`train`'s feature-engineering step).

The case study walks through a real instance where an uninstructed run
reported AUC 0.9591 that dropped to 0.9154 after a one-line fix. See
<https://weco.ai/blog/framing-the-problem>.

## Citing the case study

If you use this example, the underlying numbers come from
<https://github.com/WecoAI/fraud-detection-case-study>. Setup: 200 steps,
3 seeds per condition (6 for the Full pipeline + Full-instructions condition,
pooled since the two ablations share that configuration),
`gemini-3.1-pro-preview`.
35 changes: 35 additions & 0 deletions examples/fraud-detection/evaluate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
"""Evaluator Weco calls after each proposed edit.

Loads train.py fresh each run (Weco rewrites it in place), executes the
pipeline, and prints a single `auc_roc: 0.xxxxxx` line that Weco parses as
the metric.
"""

from __future__ import annotations

import importlib.util
import sys
from pathlib import Path


def load_module(path: str):
spec = importlib.util.spec_from_file_location("train_under_test", path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
return mod


def main() -> int:
train = load_module(str(Path(__file__).parent / "train.py"))
auc = train.run_pipeline()

if not (0.0 <= auc <= 1.0):
print(f"Constraint violated: AUC-ROC out of range ({auc})")
return 1

print(f"auc_roc: {auc:.6f}")
return 0


if __name__ == "__main__":
sys.exit(main())
99 changes: 99 additions & 0 deletions examples/fraud-detection/instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Fraud Detection Optimization Instructions

## Task
Optimize `train.py` to maximize AUC-ROC for fraud detection on the IEEE-CIS dataset. You may modify both `build_features` (feature engineering) and `train_and_evaluate` (model config). Keep `run_pipeline`'s interface and the `auc_roc: 0.xxxxxx` print format unchanged so the evaluator can parse the metric.

## Dataset Details
- 100K train / 25K val, 3.5% fraud rate, time-based split
- Base data has 297 columns after V-feature correlation pruning
- Categoricals are already label-encoded as integers
- TransactionDT is in seconds (timedelta from reference date, NOT a timestamp)

## Column Meanings (from Kaggle community reverse-engineering)

### Raw columns
- **TransactionAmt**: USD amount. Heavy-tailed (median $68, max $4578). Log transform essential.
- **ProductCD**: Product type (5 categories: C, H, R, S, W). Each has a distinct V-feature NaN pattern and fraud rate (C=11%, W=2.1%).
- **card1**: Bank Identification Number (BIN) — first 6 digits of card. Top-3 importance.
- **card2**: Additional card info. 1.5% NaN. Top-3 importance.
- **card3/card5**: Card country/product type codes.
- **card4**: Card network (visa, mastercard, etc).
- **card6**: Card type (credit, debit).
- **addr1**: Billing zip code (anonymized). 11.5% NaN.
- **addr2**: Billing country.
- **P_emaildomain**: Purchaser email domain (gmail.com, yahoo.com, etc).
- **R_emaildomain**: Recipient email domain. Mismatch between P and R = fraud signal.
- **dist1/dist2**: Distance features.

### C-features (C1-C14): Entity occurrence COUNTS, no NaN
- **C1** (importance rank #2): Count of addresses associated with the payment card
- **C2**: Count of cards at the billing address
- **C5**: Count of email addresses seen with this card
- **C11**: Count of cards associated with a user identity
- **C12**: Count of addresses associated with a user identity
- **C13** (importance rank #4): Count of distinct email domains per entity — **one of the single most predictive raw features**. High values = fraud ring.
- **C14** (importance rank #3): Related count feature

### D-features (D1-D15): TIMEDELTA in days between events
- **D1** (0.2% NaN, median 1 day): Days since last transaction. Most important D-feature. `TransactionDT/86400 - D1` estimates the **account creation date** — this is the key insight for UID construction.
- **D2** (49% NaN, median 97 days): Days since card was first associated with the identity
- **D3** (46% NaN): Days since last similar transaction
- **D4** (29.5% NaN): Days since email association
- **D10** (14% NaN): Days since last device-linked transaction
- **D11** (52% NaN): Days since account was opened / account age
- **D15** (16.5% NaN, median 46 days): Days since last transaction (alternative)
- D-feature NaN rates themselves are informative — missingness patterns encode transaction type

### M-features (M1-M9): Binary MATCH indicators
Whether certain attributes match each other (name↔address, card↔billing, device↔historical, etc). Sum of True values, count of NaN, and the M-vector signature are all useful.

### V-features (V1-V339, ~202 after pruning): Vesta-engineered risk signals
Grouped by ProductCD — each product type uses a different subset of V-features (others are NaN). V258 is the #1 most important feature overall (gain=16703). Other important V-features: V283, V69, V130, V307, V294, V201.

## Top Winning Techniques (from 1st-3rd place solutions)

### 1. UID Construction (THE most impactful single technique)
```python
D1_start = floor(TransactionDT / 86400 - D1) # estimated account creation day
uid = card1 + "_" + addr1 + "_" + D1_start
```
This creates a stable user fingerprint. All aggregation features should be computed on this UID.

### 2. UID-level aggregation features
For each UID, compute: mean, std, count of TransactionAmt. Then z-score and ratio for each transaction relative to user's history. This captures "is this transaction unusual for this user?"

### 3. Temporal centroid distance
Compute the user's typical time-of-day using cyclical hour_sin/hour_cos means. The Euclidean distance of the current transaction from the centroid = "is this at an unusual time for this user?"

### 4. D-feature lifecycle lags
D1 - D2, D1 - D4, D1 - D10, D1 - D15: Inconsistencies between these timestamps indicate synthetic identities or account takeovers.

### 5. Velocity features (sort by [uid, TransactionDT])
Time since last transaction per user. Amount change from previous transaction. High velocity + high amount = fraud signal.

### 6. Cross-entity cardinality (nunique)
How many unique addr1 values per card1? How many unique card1 per addr1? How many unique P_emaildomain per uid? High cardinality = suspicious.

### 7. NaN pattern signature
The binary NaN/not-NaN pattern across D+M columns encodes the transaction type. Compute a bitwise signature or just count NaN per feature group.

### 8. Frequency encoding
For card1, card2, addr1, P_emaildomain, etc. — map each value to its frequency. Rare values (appearing once or twice) are fraud signals.

### 9. Interaction features
- amount_zscore × time_distance (unusual amount at unusual time)
- amount_zscore × C1_ratio (unusual amount with unusual address count)
- amount / (D1 + 1) = spending rate per day since last transaction

### 10. Row-wise missingness features
Count of NaN values across D-columns, M-columns, V-columns per row. Sum/mean of M-column values. The NaN pattern encodes the transaction profile.

## Important Constraints
- Keep code under 300 lines (Weco backend limit)
- Use n_jobs=4 for any model operations
- `train.py` loads `data/base_train_small.parquet` and `data/base_val_small.parquet` — don't change these paths
- Categoricals are already integer-encoded — treat them as numeric
- Keep the `run_pipeline() -> float` function signature and the `auc_roc: 0.xxxxxx` print format intact

## Avoiding silent target leakage
`isFraud` is the label. If you compute features that aggregate across all columns of the dataframe (e.g. `(df == 0).sum(axis=1)`, row-wise NaN counts over the entire frame), drop `isFraud` and `TransactionID` first. Otherwise the label signal bleeds into the features and produces implausibly high AUC (>0.95) that collapses the moment the fix is applied. Target encoding must use out-of-fold protection: compute encoding on train folds only, never on the full train + val concat.
Loading
Loading