Skip to content

Add fraud-detection example (IEEE-CIS)#140

Open
ZhengyaoJiang wants to merge 3 commits intodevfrom
vk/fraud-detection-example
Open

Add fraud-detection example (IEEE-CIS)#140
ZhengyaoJiang wants to merge 3 commits intodevfrom
vk/fraud-detection-example

Conversation

@ZhengyaoJiang
Copy link
Copy Markdown
Contributor

Summary

  • Reproducible Weco example on real Vesta payment transactions (IEEE-CIS Fraud Detection Kaggle dataset).
  • Mirrors the published case study (blog, repo): baseline AUC ≈ 0.914, pooled 6-seed mean 0.9305 ± 0.0035 after 200 steps with `gemini-3.1-pro-preview` + the bundled `instructions.md`.
  • Scope: both feature engineering (`build_features`) and model config (`train_and_evaluate`) in `train.py` are optimizable. Weco parses `auc_roc: 0.xxxxxx` from the evaluator.

What's in the example

  • `prepare_data.py` — Kaggle download, label-encode + V-feature correlation pruning, time-based 80/20 split, subsample to 100K/25K parquet files. Uses `python -m kaggle.cli` so the venv's bin/ doesn't need to be on PATH; prints a helpful hint on 403 (rules not accepted / kaggle.json perms).
  • `train.py` — Weco's optimization target. Leakage-safe baseline: drops `isFraud` before any cross-column aggregation.
  • `evaluate.py` — reimports `train.py` each run, prints the metric line.
  • `instructions.md` — the full EDA + techniques prompt from the case study, with a silent-target-leakage guardrail.
  • `README.md` — venv setup (PEP 668 safe), data prep, baseline sanity check, Weco run command, "things to try" ablations, and a pointer to the leakage trap.

Verification

Two rounds of fresh-agent testing caught and fixed: venv prereq on modern Python installs; `python3` vs `python` on Ubuntu; `kaggle` package has no `main` so needed `kaggle.cli`. Final sanity check blocked on `403 Forbidden` from the Kaggle API (rules-accept is a per-user prereq, called out in the README).

Test plan

  • Accept competition rules at https://www.kaggle.com/c/ieee-fraud-detection
  • `cd examples/fraud-detection && python3 -m venv .venv && source .venv/bin/activate`
  • `pip install -r requirements.txt`
  • `python prepare_data.py` produces `data/base_train_small.parquet` and `data/base_val_small.parquet`
  • `python evaluate.py` prints `auc_roc: 0.91x`
  • `weco run ...` (full command in README) moves AUC into 0.928–0.933 by step ~30

🤖 Generated with Claude Code

ZhengyaoJiang and others added 2 commits April 23, 2026 15:25
Self-contained reproduction of Weco's fraud-detection case study. Downloads
the Kaggle dataset, builds a leakage-safe 100K/25K time-based parquet split,
and exposes train.py as the optimization target (feature engineering +
LightGBM config both modifiable). evaluate.py prints auc_roc for Weco.

instructions.md is the full EDA + techniques prompt from the case study —
column semantics for each feature group (TransactionAmt, C/D/M/V), 10
well-known IEEE-CIS techniques (UID construction, target encoding with OOF,
velocity features, frequency encoding), and a target-leakage guardrail
pointing out the isFraud-in-df aggregation trap.

README walks through Kaggle API setup, prepare_data step, baseline sanity
check (~0.914 AUC), and the canonical weco run command
(gemini-3.1-pro-preview, 50 steps, expected trajectory into 0.928-0.933).
Also adds 'things to try' (no-instructions variance blow-up, EDA-only
ablation, scope restriction) and a silent-target-leakage watch-out pointing
to the published case study.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er test

Two fresh-agent test rounds surfaced three issues; all fixed:

- kaggle CLI: the `kaggle` package has no __main__, so `python -m kaggle`
  crashes with ModuleNotFoundError. Correct entry point is `kaggle.cli`.
- venv instruction used `python -m venv`, which fails on Debian/Ubuntu
  systems where only `python3` exists (no python-is-python3). Changed to
  `python3 -m venv`. After activation `python` resolves correctly.
- pip-install fails on modern PEP 668 systems without a venv. README now
  leads with the venv setup before the install step, with a note on why.

Also: prepare_data.py now catches Kaggle CalledProcessError and prints
the two most common root causes (rules not accepted / kaggle.json perms)
with the exact URL to accept the competition rules.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 92cb31d6a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread examples/fraud-detection/train.py Outdated
y_val = val_df["isFraud"].values.astype(np.int32)

n_train = len(train_df)
df = pd.concat([train_df, val_df], axis=0, ignore_index=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fit feature aggregations on training data only

build_features concatenates train_df and val_df before creating grouped amount statistics and frequency encodings, so validation rows (future data in this time-based split) directly shape the engineered features used for evaluation. That leaks validation distribution into the pipeline and can systematically inflate the reported AUC that Weco optimizes against. Compute these encodings/aggregations from train_df only, then map them onto val_df with defaults for unseen keys.

Useful? React with 👍 / 👎.

Codex flagged that the baseline concatenates train + val before computing
groupby aggregations and frequency encodings, letting val-period
distribution shape train features and letting each val row influence its
own encoded values. Even with isFraud dropped first, this is time-leakage
that inflates val AUC vs. what would be seen at serving time.

Fix: compute all encoders (card1/addr1 amount stats, frequency encoding)
on train_df only; .join/.map onto both splits; fill unseen val keys with
train-global defaults. Refactored per-row features (time, amount) into a
small helper so both splits share that code path without concat.

Baseline AUC drops from the previously-reported 0.914 to 0.910 — the
right number, not artificially inflated. Expected Weco trajectory (0.928-
0.933 at 200 steps with full instructions) unchanged in shape; case study
absolute numbers used the leaky baseline so they shift slightly here.

Also expanded instructions.md and README to distinguish target leakage
(isFraud in the dataframe during aggregation) from time leakage (val
distribution in the encoder fit), with the fit-on-train / apply-to-both
pattern spelled out for future encoders Weco proposes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant